Algorithms of the Intelligent Web brief contents ■ What is the intelligent web? ■ Searching ■ Creating suggestions and recommendations 69 ■ Clustering: grouping things together 121 ■ Classification: placing things where they belong ■ Combining classifiers ■ Putting it all together: an intelligent news portal Appendix A 21 232 Introduction to BeanShell B Web crawling C Mathematical refresher 317 319 323 D Natural language processing E Neural networks 327 330

contents

What is the intelligent web? 1.1 1.2 1.3 Examples of intelligent web applications Basic elements of intelligent applications What applications can benefit from intelligence? Social networking sites Mashups Portals Media-sharing sites Online gaming 10 ■ ■ ■ Wikis ■ 1.4 How can I build intelligence in my own application? Examine your functionality and your data 11 the web 12 1.5 1.6 ■ 11 Get more data from Machine learning, data mining, and all that 15 Eight fallacies of intelligent applications 16 Fallacy #1: Your data is reliable 17 Fallacy #2: Inference happens instantaneously 18 Fallacy #3: The size of data doesn’t matter 18 Fallacy #4: Scalability of the solution isn’t an issue 18 Fallacy #5: Apply the same good library everywhere 18 Fallacy #6: The computation time is known 19 Fallacy #7: Complicated models are better 19 Fallacy #8: There are models without bias 19 ■ ■ ■ ■ ■ ■ vii Licensed to Deborah Christiansen CuuDuongThanCong.com viii CONTENTS 1.7 1.8 Summary 19 References 20 Searching 21 2.1 Searching with Lucene 22 Understanding the Lucene code 24 of search 29 2.2 2.3 ■ Understanding the basic stages Why search beyond indexing? 32 Improving search results based on link analysis 33 An introduction to PageRank 34 Calculating the PageRank vector 35 alpha: The effect of teleportation between web pages 38 Understanding the power method 38 Combining the index scores and the PageRank scores 43 ■ ■ ■ 2.4 Improving search results based on user clicks 45 A first look at user clicks 46 Using the NaiveBayes classifier 48 Combining Lucene indexing, PageRank, and user clicks 51 ■ 2.5 Ranking Word, PDF, and other documents without links An introduction to DocRank 55 2.6 2.7 2.8 2.9 2.10 ■ 55 The inner workings of DocRank 57 Large-scale implementation issues 61 Is what you got what you want? Precision and recall Summary 65 To 66 References 68 64 Creating suggestions and recommendations 69 3.1 An online music store: the basic concepts 70 The concepts of distance and similarity 71 A closer look at the calculation of similarity 76 Which is the best similarity formula? 79 ■ ■ 3.2 How recommendation engines work? 80 Recommendations based on similar users 80 Recommendations based on similar items 89 Recommendations based on content 92 ■ ■ 3.3 Recommending friends, articles, and news stories Introducing MyDiggSpace.com 99 inner workings of DiggDelphi 102 3.4 ■ Finding friends 100 Large-scale implementation and evaluation issues Licensed to Deborah Christiansen CuuDuongThanCong.com The Recommending movies on a site such as Netflix.com An introduction of movie datasets and recommenders 107 normalization and correlation coefficients 110 3.5 99 ■ ■ 107 Data 115 ix CONTENTS 3.6 3.7 3.8 Summary 117 To Do 117 References 119 Clustering: grouping things together 121 4.1 The need for clustering 122 User groups on a website: a case study 123 Finding groups with a SQL order by clause 124 Finding groups with array sorting 125 ■ ■ 4.2 An overview of clustering algorithms 128 Clustering algorithms based on cluster structure 129 Clustering algorithms based on data type and structure 130 Clustering algorithms based on data size 131 ■ ■ 4.3 Link-based algorithms 132 The dendrogram: a basic clustering data structure 132 A first look at link-based algorithms 134 The single-link algorithm 135 The average-link algorithm 137 The minimum-spanning-tree algorithm 139 ■ ■ ■ ■ 4.4 The k-means algorithm 142 A first look at the k-means algorithm 142 means 143 4.5 DBSCAN ■ 151 Clustering issues in very large datasets Computational complexity 157 4.8 4.9 4.10 146 Why does ROCK rock? 147 A first look at density-based algorithms 151 DBSCAN 153 4.7 The inner workings of k- Robust Clustering Using Links (ROCK) Introducing ROCK 146 4.6 ■ ■ ■ The inner workings of 157 High dimensionality 158 Algorithms of the Intelligent Web Haralambos Marmanis Dmitry Babenko A n algorithm is a sequence of steps that solves a problem Algorithms of the Intelligent Web provides exactly that— explicit, clearly organized patterns to implement valuable web application features like recommendation engines, smart searching, content organizers, and much more With these techniques you'll capture vital raw information about your users and profitably transform it into action Algorithms of the Intelligent Web is a handbook for web developers who want to exploit relationships in user data that can't be discovered manually The book presents crystal-clear explanations of techniques you can apply immediately It is based on the authors' practical experience as web developers and their deep expertise in the science of machine learning With a wealth of detailed, Java-based examples this book shows you how to build applications that behave intelligently and learn from your users' actions

What's Inside Create recommendations like Netflix or Amazon Implement Google's PageRank algorithm Discover matches on social-networking sites Organize your news group discussions Select topics of interest from shared bookmarks Filter spam and categorize emails based on content

Dr Haralambos (Babis) Marmanis is a pioneer in the adoption of machine learning techniques for industrial solutions, and also a world expert in supply management Dmitry Babenko has designed applications and infrastructure for banking, insurance, supply-chain management, and business intelligence companies Apache CXF supports the following standards: JAX-WS 2.0, JAX-WSA, JSR-181, SAAJ, SOAP 1.1, 1.2, WS-I Basic Profile, WS-Security, WS-Addressing, WS-RM, WS-Policy, WSDL 1.1 and 2.0 It also supports... Link-based algorithms 132 The dendrogram: a basic clustering data structure 132 A first look at link-based algorithms 134 The single-link algorithm 135 The average-link algorithm 137 The minimum-spanning-tree... that are shared among the contributors of the work-in-progress, or simply public, freely available, documents A user can mark a piece of the work-in-progress and ask the system to be notified when