1. Trang chủ
  2. » Luận Văn - Báo Cáo

The WWW and The PageRank Related Problems

81 339 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 81
Dung lượng 12,71 MB

Nội dung

ĐẠI HỌC QUỐC GIA HÀ NỘI TRƯỜNG ĐẠI HỌC KHOA HỌC TỰ NHIÊN Nguyễn Hoài Nam WWW AND THE PAGERANK-RELATED PROBLEMS LUẬN VĂN THẠC SĨ KHOA HỌC Hà Nội - 2006 ĐẠI HỌC QUỐC GIA HÀ NỘI TRƯỜNG ĐẠI HỌC KHOA HỌC TỰ NHIÊN Nguyễn Hoài Nam WWW AND THE PAGERANK-RELATED PROBLEMS Chuyên ngành: Đảm bảo toán học cho máy tính và hệ thống tính toán Mã số: 1.01.07 LUẬN VĂN THẠC SĨ KHOA HỌC NGƯỜI HƯỚNG DẪN KHOA HỌC: PGS. TS. HÀ QUANG THỤY Hà Nội - 2006 HANOI NATIONAL UNIVERSITY UNIVERSITY OF SCIENCE Nguyen Hoai Nam WWW AND THE PAGERANK-RELATED PROBLEMS Major: Mathematical assurances for computers and computing systems Code: 1.01.07 MASTER THESIS THESIS SUPERVISOR: ASSOC. PROF. HA QUANG THUY Hanoi - 2006 Intentionally left blank for your note Contents List of Figures ii List of Tables iii Introduction iv Acknowledgement vi Abstract ix List of Glossaries xi 1 Objects’ ranks and applications to WWW 1 1.1 Rank of objects . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Rank for objects different from web page. . . . . . . . . . 2 1.1.2 Linkage usage in search engine . . . . . . . . . . . . . 4 1.2 Basis of PageRank . . . . . . . . . . . . . . . . . . . . . . 10 1.2.1 Mathematical basis . . . . . . . . . . . . . . . . . . . 11 1.2.2 Practical issues. . . . . . . . . . . . . . . . . . . . . 13 Conclusion of chapter . . . . . . . . . . . . . . . . . . . . . . . 19 2 Some PageRank-related problems 20 2.1 Accelerating problems . . . . . . . . . . . . . . . . . . . . . 20 2.1.1 Related works . . . . . . . . . . . . . . . . . . . . . 21 2.1.2 Exploiting block structure of the Web . . . . . . . . . . . 22 2.2 Connected-component PageRank approach . . . . . . . . . . . 30 2.2.1 Initial ideas. . . . . . . . . . . . . . . . . . . . . . . 30 2.2.2 Mathematical basis of CCP . . . . . . . . . . . . . . . 32 2.2.3 On practical side of CCP. . . . . . . . . . . . . . . . . 35 2.3 Spam and spam detection . . . . . . . . . . . . . . . . . . . 37 2.3.1 Introduction to Web spam . . . . . . . . . . . . . . . . 37 2.3.2 Spam techniques . . . . . . . . . . . . . . . . . . . . 38 Conclusion of chapter . . . . . . . . . . . . . . . . . . . . . . . 42 3 Implementations and experimental results 43 3.1 Search engine Nutch . . . . . . . . . . . . . . . . . . . . . 43 3.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.2.1 Infrastructure and data sets . . . . . . . . . . . . . . . 48 3.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . 48 Conclusion of chapter . . . . . . . . . . . . . . . . . . . . . . . 58 Conclusions and future works 59 References 61 List of Figures 1.1 A small directed Web graph with 7 nodes . . . . . . . . . . . . . . . 13 1.2 Ranks of page in SmallWeb graph with α = .9 . . . . . . . . . . . . . 16 1.3 Figures exemplifying results with different α of SmallSet . . . . . . . 17 1.4 Graph of iterations needed with α ∈ [.5; .9] of SmallSet . . . . . . . . 18 2.1 Convergence rates for standard PageRank vs. BlockRank . . . . . . 29 2.2 Unarranged matrix and arranged matrix . . . . . . . . . . . . . . . . 31 2.3 Boosting techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.1 Nutch packages dependencies . . . . . . . . . . . . . . . . . . . . . 46 3.2 Matrices from set 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.3 Ranks from PageRank vs. CCP for set 1 . . . . . . . . . . . . . . . . 50 3.4 Ranks and differences corresponding to each block for set 1 . . . . 51 3.5 Time of two methods with different decay values for set 1 . . . . . . 51 3.6 No. of iterations of two methods with different decay values for set 1 52 3.7 Matrices from set 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.8 Ranks from PageRank vs. CCP for set 2 . . . . . . . . . . . . . . . . 53 3.9 Ranks and differences corresponding to each block for set 2 . . . . 53 3.10 Time of two methods with different decay values for set 2 . . . . . . 54 3.11 No. of iterations of two methods with different decay values for set 2 54 3.12 Matrices from set 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.13 Ranks from PageRank vs. CCP for set 3 . . . . . . . . . . . . . . . . 55 3.14 Ranks and differences corresponding to each block for set 3 . . . . 56 3.15 Time of two methods with different decay values for set 3 . . . . . . 56 3.16 No. of iterations of two methods with different decay values for set 3 57 i 3.17 Mean case for 3 sets with different decay values . . . . . . . . . . . 57 [...]... to sort it in the list of pages [16, 6] The most famous approach discovered is PageRankTM from Google [1] This thesis focuses on PageRank and its modifications to tweak up the computing process With the aim of understanding network phenomena and, especially, PageRank computation, this thesis is organized as follow: • Chapter 1: Demontrates on the rank concept and PageRank This chapter is the background... of the tie’s loops: these are pages that can be reached from the core but not vice versa The other loop consists of 22% of the pages that can reach the core, but cannot be reached from it (The remaining nodes can neither reach the core nor can be reached from the core.) With its own complex structure, WWW has created many difficulties in people’s aim to understand it, and, furthermore, to utilize and. .. Abstract In this thesis, we will study about the whole WWW on the point of view of social networks There are many directions to come up to WWW but we chose the way of social networks because it will give us many advantages in doing research Everytime we heard of WWW, that refer us to the Internet - the biggest network all over the world Networks are all around us, all the time From the biochemistry... both depends on and influences the importance of other pages Page and Brin [16] uses the linkage structure of the Web to build a stochastic irreducible Markov chain with a transition probability matrix We will first present the simple definition of PageRank and discuss some of the computa11 1.2 Basis of PageRank tional aspect, then more details on computational aspect are mentioned in the following chapter... hypertext and web mining, relational learning and inductive logic programming, and graph mining We use the term link mining to put a special emphasis on the links, moving them up to first-class citizens in the data analysis endeavor Now we will take care of link mining on the point of view of the main inspiration WWW There is no doubt that the Web is really gigantic and dealing with it is really a challenging... that the size and the content of the Web is changing second by second with the dramatic growth rate The size of WWW is estimated to have doubled in less than two years and this growth rate is projected to continue for the next two years As Google [1] has announced, they have a repository of over 8 billion pages indexed and this 5 1.1 Rank of objects number does not reflect well of all pages exist on the. .. size and rapid change, the interlinked nature of the Web sets it apart from many other collections Several studies aim to understand how the Web’s linkage is structured and how that structure can be modeled One recent study, for example, suggests that the link structure of the Web is somewhat like a "bow-tie" [7] That is, about 28% of the pages constitutes a strongly connected core (the center of the. .. thesis • What are raised problems associated to the answers of two previous questions? These problems cover storage capacity, computational complexity, ix accelerating methods, programming skills and many more To check whether the theory to tweak up the existing method is true, we implemented with data sets downloaded from the Internet Experimental figures show that our method is relatively good and. .. configuration and target price of 500US$ Besides, in the store, there is only one computer with the specific configuration (c = 1, quite good) and high price (i.e p = 0.2) All the remaining computers are in the near configuration with c = 0.8 and the price for all is moderate, p = 0.5 Note that, with the same or near configuration, computers can differ in price due to the make of manufacturers Applying the definition... and traditional data analysis fail to cope with this level of complexity That is named data-mining with an important branch, Web-mining The key inspiration of web-mining based on WWW s growth There is no doubt that WWW is the most valuable source of information but it is not very easy for you to get used to if you do not know the way Several studies try to estimate the size of the Web and most of them . UNIVERSITY UNIVERSITY OF SCIENCE Nguyen Hoai Nam WWW AND THE PAGERANK-RELATED PROBLEMS Major: Mathematical assurances for computers and computing systems Code: 1.01.07 MASTER THESIS THESIS SUPERVISOR: ASSOC because you deserve that for your import ance to me. Hanoi, Octorber 25 th , 2006 Nguyen Hoai Nam Abstract In this thesis, we will study about the whole WWW on the point of view of social n etworks [1]. This thesis focuses on PageRank and its modifications to tweak up the computing process. With the aim of understanding network phenomena and, especially, PageR- ank computation, this thesis

Ngày đăng: 26/07/2014, 08:17

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN

w