Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 344 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
344
Dung lượng
4,81 MB
Nội dung
Table of Contents Preface xiii Introduction: Hacking on Twitter Data Installing Python Development Tools Collecting and Manipulating Twitter Data Tinkering with Twitter’s API Frequency Analysis and Lexical Diversity Visualizing Tweet Graphs Synthesis: Visualizing Retweets with Protovis Closing Remarks 14 15 17 Microformats: Semantic Markup and Common Sense Collide 19 XFN and Friends Exploring Social Connections with XFN A Breadth-First Crawl of XFN Data Geocoordinates: A Common Thread for Just About Anything Wikipedia Articles + Google Maps = Road Trip? Slicing and Dicing Recipes (for the Health of It) Collecting Restaurant Reviews Summary 19 22 23 30 30 35 37 40 Mailboxes: Oldies but Goodies 41 mbox: The Quick and Dirty on Unix Mailboxes mbox + CouchDB = Relaxed Email Analysis Bulk Loading Documents into CouchDB Sensible Sorting Map/Reduce-Inspired Frequency Analysis Sorting Documents by Value couchdb-lucene: Full-Text Indexing and More Threading Together Conversations Look Who’s Talking 42 48 51 52 55 61 63 67 73 ix Visualizing Mail “Events” with SIMILE Timeline Analyzing Your Own Mail Data The Graph Your (Gmail) Inbox Chrome Extension Closing Remarks 77 80 81 82 Twitter: Friends, Followers, and Setwise Operations 83 RESTful and OAuth-Cladded APIs No, You Can’t Have My Password A Lean, Mean Data-Collecting Machine A Very Brief Refactor Interlude Redis: A Data Structures Server Elementary Set Operations Souping Up the Machine with Basic Friend/Follower Metrics Calculating Similarity by Computing Common Friends and Followers Measuring Influence Constructing Friendship Graphs Clique Detection and Analysis The Infochimps “Strong Links” API Interactive 3D Graph Visualization Summary 84 85 88 91 92 94 96 102 103 108 110 114 116 117 Twitter: The Tweet, the Whole Tweet, and Nothing but the Tweet 119 Pen : Sword :: Tweet : Machine Gun (?!?) Analyzing Tweets (One Entity at a Time) Tapping (Tim’s) Tweets Who Does Tim Retweet Most Often? What’s Tim’s Influence? How Many of Tim’s Tweets Contain Hashtags? Juxtaposing Latent Social Networks (or #JustinBieber Versus #TeaParty) What Entities Co-Occur Most Often with #JustinBieber and #TeaParty Tweets? On Average, Do #JustinBieber or #TeaParty Tweets Have More Hashtags? Which Gets Retweeted More Often: #JustinBieber or #TeaParty? How Much Overlap Exists Between the Entities of #TeaParty and #JustinBieber Tweets? Visualizing Tons of Tweets Visualizing Tweets with Tricked-Out Tag Clouds Visualizing Community Structures in Twitter Search Results Closing Remarks 119 122 125 138 141 144 147 148 153 154 156 158 158 162 166 LinkedIn: Clustering Your Professional Network for Fun (and Profit?) 167 Motivation for Clustering x | Table of Contents 168 Clustering Contacts by Job Title Standardizing and Counting Job Titles Common Similarity Metrics for Clustering A Greedy Approach to Clustering Hierarchical and k-Means Clustering Fetching Extended Profile Information Geographically Clustering Your Network Mapping Your Professional Network with Google Earth Mapping Your Professional Network with Dorling Cartograms Closing Remarks 172 172 174 177 185 188 193 193 198 198 Google Buzz: TF-IDF, Cosine Similarity, and Collocations 201 Buzz = Twitter + Blogs (???) Data Hacking with NLTK Text Mining Fundamentals A Whiz-Bang Introduction to TF-IDF Querying Buzz Data with TF-IDF Finding Similar Documents The Theory Behind Vector Space Models and Cosine Similarity Clustering Posts with Cosine Similarity Visualizing Similarity with Graph Visualizations Buzzing on Bigrams How the Collocation Sausage Is Made: Contingency Tables and Scoring Functions Tapping into Your Gmail Accessing Gmail with OAuth Fetching and Parsing Email Messages Before You Go Off and Try to Build a Search Engine… Closing Remarks 202 205 209 209 215 216 217 219 222 224 228 231 232 233 235 237 Blogs et al.: Natural Language Processing (and Beyond) 239 NLP: A Pareto-Like Introduction Syntax and Semantics A Brief Thought Exercise A Typical NLP Pipeline with NLTK Sentence Detection in Blogs with NLTK Summarizing Documents Analysis of Luhn’s Summarization Algorithm Entity-Centric Analysis: A Deeper Understanding of the Data Quality of Analytics Closing Remarks 239 240 241 242 245 250 256 258 267 269 Table of Contents | xi Facebook: The All-in-One Wonder 271 Tapping into Your Social Network Data From Zero to Access Token in Under 10 Minutes Facebook’s Query APIs Visualizing Facebook Data Visualizing Your Entire Social Network Visualizing Mutual Friendships Within Groups Where Have My Friends All Gone? (A Data-Driven Game) Visualizing Wall Data As a (Rotating) Tag Cloud Closing Remarks 272 272 278 289 289 301 304 309 311 10 The Semantic Web: A Cocktail Discussion 313 An Evolutionary Revolution? Man Cannot Live on Facts Alone Open-World Versus Closed-World Assumptions Inferencing About an Open World with FuXi Hope 313 315 315 316 319 Index 321 xii | Table of Contents Preface The Web is more a social creation than a technical one I designed it for a social effect—to help people work together—and not as a technical toy The ultimate goal of the Web is to support and improve our weblike existence in the world We clump into families, associations, and companies We develop trust across the miles and distrust around the corner —Tim Berners-Lee, Weaving the Web (Harper) To Read This Book? If you have a basic programming background and are interested in insight surrounding the opportunities that arise from mining and analyzing data from the social web, you’ve come to the right place We’ll begin getting our hands dirty after just a few more pages of frontmatter I’ll be forthright, however, and say upfront that one of the chief complaints you’re likely to have about this book is that all of the chapters are far too short Unfortunately, that’s always the case when trying to capture a space that’s evolving daily and is so rich and abundant with opportunities That said, I’m a fan of the “80-20 rule”, and I sincerely believe that this book is a reasonable attempt at presenting the most interesting 20 percent of the space that you’d want to explore with 80 percent of your available time This book is short, but it does cover a lot of ground Generally speaking, there’s a little more breadth than depth, although where the situation lends itself and the subject matter is complex enough to warrant a more detailed discussion, there are a few deep dives into interesting mining and analysis techniques The book was written so that you could have the option of either reading it from cover to cover to get a broad primer on working with social web data, or pick and choose chapters that are of particular interest to you In other words, each chapter is designed to be bite-sized and fairly standalone, but special care was taken to introduce material in a particular order so that the book as a whole is an enjoyable read xiii Social networking websites such as Facebook, Twitter, and LinkedIn have transitioned from fad to mainstream to global phenomena over the last few years In the first quarter of 2010, the popular social networking site Facebook surpassed Google for the most page visits,* confirming a definite shift in how people are spending their time online Asserting that this event indicates that the Web has now become more a social milieu than a tool for research and information might be somewhat indefensible; however, this data point undeniably indicates that social networking websites are satisfying some very basic human desires on a massive scale in ways that search engines were never designed to fulfill Social networks really are changing the way we live our lives on and off the Web,† and they are enabling technology to bring out the best (and sometimes the worst) in us The explosion of social networks is just one of the ways that the gap between the real world and cyberspace is continuing to narrow Generally speaking, each chapter of this book interlaces slivers of the social web along with data mining, analysis, and visualization techniques to answer the following kinds of questions: • • • • • • Who knows whom, and what friends they have in common? How frequently are certain people communicating with one another? How symmetrical is the communication between people? Who are the quietest/chattiest people in a network? Who are the most influential/popular people in a network? What are people chatting about (and is it interesting)? The answers to these types of questions generally connect two or more people together and point back to a context indicating why the connection exists The work involved in answering these kinds of questions is only the beginning of more complex analytic processes, but you have to start somewhere, and the low-hanging fruit is surprisingly easy to grasp, thanks to well-engineered social networking APIs and open source toolkits Loosely speaking, this book treats the social web‡ as a graph of people, activities, events, concepts, etc Industry leaders such as Google and Facebook have begun to increasingly push graph-centric terminology rather than web-centric terminology as they simultaneously promote graph-based APIs In fact, Tim Berners-Lee has suggested that perhaps he should have used the term Giant Global Graph (GGG) instead of World Wide Web (WWW), because the terms “web” and “graph” can be so freely interchanged in the context of defining a topology for the Internet Whether the fullness of Tim Berners* See the opening paragraph of Chapter † Mark Zuckerberg, the creator of Facebook, was named Person of the Year for 2010 by Time magazine (http: //www.time.com/time/specials/packages/article/0,28804,2036683_2037183_2037185,00.html) ‡ See http://journal.planetwork.net/article.php?lab=reed0704 for another perspective on the social web that focuses on digital identities xiv | Preface Lee’s original vision will ever be realized remains to be seen, but the Web as we know it is getting richer and richer with social data all the time When we look back years from now, it may well seem obvious that the second- and third-level effects created by an inherently social web were necessary enablers for the realization of a truly semantic web The gap between the two seems to be closing Or Not to Read This Book? Activities such as building your own natural language processor from scratch, venturing far beyond the typical usage of visualization libraries, and constructing just about anything state-of-the-art are not within the scope of this book You’ll be really disappointed if you purchase this book because you want to one of those things However, just because it’s not realistic or our goal to capture the holy grail of text analytics or record matching in a mere few hundred pages doesn’t mean that this book won’t enable you to attain reasonable solutions to hard problems, apply those solutions to the social web as a domain, and have a lot of fun in the process It also doesn’t mean that taking a very active interest in these fascinating research areas wouldn’t potentially be a great idea for you to consider A short book like this one can’t much beyond whetting your appetite and giving you enough insight to go out and start making a difference somewhere with your newly found passion for data hacking Maybe it’s obvious in this day and age, but another important item of note is that this book generally assumes that you’re connected to the Internet This wouldn’t be a great book to take on vacation with you to a remote location, because it contains many references that have been hyperlinked, and all of the code examples are hyperlinked directly to GitHub, a very social Git repository that will always reflect the most up-todate example code available The hope is that social coding will enhance collaboration between like-minded folks such as ourselves who want to work together to extend the examples and hack away at interesting problems Hopefully, you’ll fork, extend, and improve the source—and maybe even make some new friends along the way Readily accessible sources of online information such as API docs are also liberally hyperlinked, and it is assumed that you’d rather look them up online than rely on inevitably stale copies in this printed book The official GitHub repository that maintains the latest and greatest bug-fixed source code for this book is http://github.com/ptwobrussell/ Mining-the-Social-Web The official Twitter account for this book is @SocialWebMining This book is also not recommended if you need a reference that gets you up to speed on distributed computing platforms such as sharded MySQL clusters or NoSQL technologies such as Hadoop or Cassandra We use some less-than-conventional storage technologies such as CouchDB and Redis, but always within the context of running on Preface | xv ... background and are interested in insight surrounding the opportunities that arise from mining and analyzing data from the social web, you’ve come to the right place We’ll begin getting our hands dirty... many of the most popular social networking sites have licensing terms that prohibit the use of their data outside of their platforms, but at the moment, it’s par for the course Most social networking... gardens, but from their standpoint (and the standpoint of their investors) a lot of the value these companies offer currently relies on controlling the platforms and protecting the privacy of their