Appendixes
The appendixes of this book present some crosscutting material that undergirds much of the content that precedes it:
• Appendix A presents a brief overview of the technology that powers the virtual machine experience that accompanies this book, as well as a brief discussion on the scope and purpose of the virtual machine.
• Appendix B provides a short discussion of Open Authorization (OAuth), the in‐
dustry protocol that undergirds accessing social data from just about any notable social website with an API.
• Appendix C is a very short primer on some common Python idioms that you’ll encounter in the source code for this book; it highlights some subtleties about IPy‐
thon Notebook that you may benefit from knowing about.
APPENDIX A
Information About This Book’s Virtual Machine Experience
Just as each chapter in this book has a corresponding IPython Notebook, each appendix also has a corresponding IPython Notebook. All notebooks, regardless of purpose, are maintained in the book’s GitHub source code repository. The particular appendix that you are reading here “in print” serves as a special cross-reference to the IPython Note‐
book that provides step-by-step instructions for how to install and configure the book’s virtual machine.
You are strongly encouraged to install the virtual machine as a development environ‐
ment instead of using your existing Python installation, because there are some non‐
trivial configuration management issues involved in installing IPython Notebook and all of its dependencies for scientific computing. The various other third-party Python packages that are used throughout the book and the need to support users across mul‐
tiple platforms only exacerbate the complexity that can be involved in getting a basic development environment up and running. Therefore, this book comes with a virtual machine that provides all readers and consumers of the source code with the least amount of friction possible to interactively follow along with the examples. Even if you are an expert in working with Python developer tools, you will still likely save some time by taking advantage of the book’s virtual machine experience on your first pass through the text. Give it a try. You’ll be glad that you did.
The corresponding read-only IPython Notebook, Appendix A: Virtu‐
al Machine Experience, is maintained with the book’s GitHub source code repository and contains step-by-step instructions for getting started.
401
APPENDIX B
OAuth Primer
Just as each chapter in this book has a corresponding IPython Notebook, each appendix also has a corresponding IPython Notebook. All notebooks, regardless of purpose, are maintained in the book’s GitHub source code repository. The particular appendix that you are reading here “in print” serves as a special cross-reference to the IPython Note‐
book that provides example code demonstrating interactive OAuth flows that involve explicit user authorization, which is needed if you implement a user-facing application.
The remainder of this appendix provides a terse discussion of OAuth as a basic orien‐
tation. The sample code for OAuth flows for popular websites such as Twitter, Facebook, and LinkedIn is in the corresponding IPython Notebook that is available with this book’s source code.
Like the other appendixes, this appendix has a corresponding IPy‐
thon Notebook entitled Appendix B: OAuth Primer that you can view online.
Overview
OAuth stands for “open authorization” and provides a means for users to authorize an application to access their account data through an API without the users needing to hand over sensitive credentials such as a username and password combination. Al‐
though OAuth is presented here in the context of the social web, keep in mind that it’s a specification that has wide applicability in any context in which users would like to authorize an application to take certain actions on their behalf. In general, users can control the level of access for a third-party application (subject to the degree of API granularity that the provider implements) and revoke it at any time. For example, con‐
sider the case of Facebook, in which extremely fine-grained permissions are
403
1. Throughout this discussion, use of the term “OAuth 1.0” is technically intended to mean “OAuth 1.0a,” given that OAuth 1.0 revision A obsoleted OAuth 1.0 and is the widely implemented standard.
implemented and enable users to allow third-party applications to access very specific pieces of sensitive account information.
Given the nearly ubiquitous popularity of platforms such as Twitter, Facebook, LinkedIn, and Google+, and the vast utility of third-party applications that are devel‐
oped on these social web platforms, it’s no surprise that they’ve adopted OAuth as a common means of opening up their platforms. However, like any other specification or protocol, OAuth implementations across social web properties currently vary with re‐
gard to the version of the specification that’s implemented, and there are sometimes a few idiosyncrasies that come up in particular implementations. The remainder of this section provides a brief overview of OAuth 1.0a, as defined by RFC 5849, and OAuth 2.0, as defined by RFC 6749, that you’ll encounter as you mine the social web and engage in other programming endeavors involving platform APIs.
OAuth 1.0A
OAuth 1.01 defines a protocol that enables a web client to access a resource owner’s protected resource on a server and is described in great detail in the OAuth 1.0 Guide.
As you already know, the reason for its existence is to avoid the problem of users (re‐
source owners) sharing passwords with web applications, and although it is fairly nar‐
rowly defined in its scope, it does do very well the one thing it claims to do. As it turns out, one of the primary developer complaints about OAuth 1.0 that initially hindered adoption was that it was very tedious to implement because of the various encryption details involved (such as HMAC signature generation), given that OAuth 1.0 does not assume that credentials are exchanged over a secure SSL connection using an HTTPS protocol. In other words, OAuth 1.0 uses cryptography as part of its flow to guarantee security during transmissions over the wire.
Although we’ll be fairly informal in this discussion, you might care to know that in OAuth parlance, the application that is requesting access is often known as the client (sometimes called the consumer), the social website or service that houses the protected resources is the server (sometimes called the service provider), and the user who is grant‐
ing access is the resource owner. Since there are three parties involved in the process, the series of redirects among them is often referred to as a three-legged flow, or more colloquially, the “OAuth dance.” Although the implementation and security details are a bit messy, there are essentially just a few fundamental steps involved in the OAuth dance that ultimately enable a client application to access protected resources on the resource owner’s behalf from the service provider:
404 | Appendix B: OAuth Primer
1. The client obtains an unauthorized request token from the service provider.
2. The resource owner authorizes the request token.
3. The client exchanges the request token for an access token.
4. The client uses the access token to access protected resources on behalf of the re‐
source owner.
In terms of particular credentials, a client starts with a consumer key and consumer secret and by the end of the OAuth dance winds up with an access token and access token secret that can be used to access protected resources. All things considered, OAuth 1.0 sets out to enable client applications to securely obtain authorization from resource owners to access account resources from service providers, and despite some arguably tedious implementation details, it provides a broadly accepted protocol that makes good on this intention. It is likely that OAuth 1.0 will be around for a while.
“Introduction to OAuth (in Plain English)” illustrates how an end user (as a resource owner) could authorize a link-shortening service such as bit.ly (as a client) to automatically post links to Twitter (as a ser‐
vice provider). It is worth reviewing and drives home the abstract concepts presented in this section.
OAuth 2.0
Whereas OAuth 1.0 enables a useful, albeit somewhat narrow, authorization flow for web applications, OAuth 2.0 was originally intended to significantly simplify imple‐
mentation details for web application developers by relying completely on SSL for se‐
curity aspects, and to satisfy a much broader array of use cases. Such use cases ranged from support for mobile devices to the needs of the enterprise, and even somewhat futuristically considered the needs of the “Internet of Things,” such as devices that might appear in your home.
Facebook was an early adopter, with migration plans dating back to early drafts of OAuth 2.0 in 2011 and a platform that quickly relied exclusively on a portion of the OAuth 2.0 specification, while LinkedIn waited to implement support for OAuth 2.0 until early 2013. Although Twitter’s standard user-based authentication is still based squarely on OAuth 1.0a, it implemented application-based authentication in early 2013 that’s mod‐
eled on the Client Credentials Grant flow of the OAuth 2.0 spec. Finally, Google currently implements OAuth 2.0 for services such as Google+, and has deprecated support for OAuth 1.0 as of April 2012. As you can see, the reaction was somewhat mixed in that not every social website immediately scrambled to implement OAuth 2.0 as soon as it was announced.
Overview | 405
Still, it’s a bit unclear whether or not OAuth 2.0 as originally envisioned will ever become the new industry standard. One popular blog post, entitled “OAuth 2.0 and the Road to Hell” (and its corresponding Hacker News discussion) is worth reviewing and sum‐
marizes a lot of the issues. The post was written by Eran Hammer, who resigned his role as lead author and editor of the OAuth 2.0 specification as of mid-2012 after working on it for several years. It appears as though “design by committee” around large open- ended enterprise problems suffocated some of the enthusiasm and progress of the working group, and although the specification was published in late 2012, it is unclear as to whether it provides an actual specification or a blueprint for one. Fortunately, over the previous years, lots of terrific OAuth frameworks have emerged to allay most of the OAuth 1.0 development pains associated with accessing APIs, and developers have continued innovating despite the initial stumbling blocks with OAuth 1.0. As a case in point, in working with Python packages in earlier chapters of this book, you haven’t had to know or care about any of the complex details involved with OAuth 1.0a implemen‐
tations; you’ve just had to understand the gist of how it works. What does seem clear despite some of the analysis paralysis and “good intentions” associated with OAuth 2.0, however, is that several of its flows seem well-defined enough that large social web providers are moving forward with them.
As you now know, unlike OAuth 1.0 implementations, which consist of a fairly rigid set of steps, OAuth 2.0 implementations can vary somewhat depending on the particular use case. A typical OAuth 2.0 flow, however, does take advantage of SSL and essentially just consists of a few redirects that, at a high enough level, don’t look all that different from the previously mentioned set of steps involving an OAuth 1.0 flow. For example, Twitter’s recent application-only authentication involves little more than an application exchanging its consumer key and consumer secret for an access token over a secure SSL connection. Again, implementations will vary based on the particular use case, and although it’s not exactly light reading, Section 4 of the OAuth 2.0 spec is fairly digestible content if you’re interested in some of the details. If you choose to review it, just keep in mind that some of the terminology differs between OAuth 1.0 and OAuth 2.0, so it may be easier to focus on understanding one specification at a time as opposed to learning them both simultaneously.
Chapter 9 of Jonathan LeBlanc’s Programming Social Applications (O’Reilly) provides a nice discussion of OAuth 1.0 and OAuth 2.0 in the context of building social web applications.
The idiosyncrasies of OAuth and the underlying implementations of OAuth 1.0 and OAuth 2.0 are generally not going to be all that important to you as a social web miner.
This discussion was tailored to provide some surrounding context so that you have a basic understanding of the key concepts involved and to provide some starting points for further study and research should you like to do so. As you may have already 406 | Appendix B: OAuth Primer
gathered, the devil really is in the details. Fortunately, nice third-party libraries largely obsolete the need to know much about those details on a day-to-day basis, although they can sometimes come in handy. The online code for this appendix features both OAuth 1.0 and OAuth 2.0 flows, and you can dig into as much detail with them as you’d like.
Overview | 407
APPENDIX C
Python and IPython Notebook Tips & Tricks
Just as each chapter in this book has a corresponding IPython Notebook, each appendix also has a corresponding IPython Notebook. Like Appendix A, this “in print” appendix serves as a special cross-reference to an IPython Notebook that’s maintained in the book’s GitHub source code repository and includes a collection of Python idioms as well as some helpful tips for using IPython Notebook.
The corresponding IPython Notebook for this appendix, Appendix C:
Python and IPython Notebook Tips & Tricks, contains additional ex‐
amples of common Python idioms that you may find of particular relevance as you work through this book. It also contains some help‐
ful tips about working with IPython Notebook that may save you some time.
Even though it’s not that uncommon to hear Python referred to as “executable pseu‐
docode,” a brief review of Python as a general-purpose programming language may be worthwhile for readers new to Python. Please consider following along with Sections 1 through 8 of the Python Tutorial as a means of basic familiarization if you feel that you could benefit from a general-purpose introduction to Python as a programming lan‐
guage. It’s a worthwhile investment and will maximize your enjoyment of this book.
409
We’d like to hear your suggestions for improving our indexes. Send email to index@oreilly.com.
Index
Symbols
$ (MongoDB operator), 248
$** (MongoDB operator), 260 68-95-99.7 rule, 174
A
access token (OAuth) about, 405 Facebook, 48 GitHub, 282–284 Twitter, 13, 354–357
access token secret (OAuth), 13, 354–357, 405 activities (Google+), 137, 142–147
agglomeration clustering technique, 121 aggregation framework (MongoDB), 255–259, analyzing GitHub API363
about, 292
extending interest graphs, 299–310 graph centrality measures, 296–299 nodes as query pivots, 311–315 seeding interest graphs, 292–296 visualizing interest graphs, 316–318 analyzing Google+ data
bigrams in human language, 167–177 TF-IDF, 147–155
analyzing LinkedIn data
clustering data, 97–100, 115–130
measuring similarity, 98, 112–114 normalizing data, 101–112 analyzing mailboxes
analyzing Enron corpus, 246–263 analyzing mail data, 268–274
analyzing sender/recipient patterns, 250–255 analyzing Social Graph connections
about, 59–63
analyzing Facebook pages, 63–70 analyzing likes, 70–78
analyzing mutual friendships, 78–85 examining friendships, 70–85 analyzing Twitter platform objects
about, 26–27
analyzing favorite tweets, 394
extracting tweet entities, 28, 368, 371, 381 frequency analysis, 29–32, 36–41, 373 lexical diversity of tweets, 32–34, 390 patterns in retweets, 34–36, 374–376 analyzing web pages
by scraping, parsing, and crawling, 183–190 entity-centric, 209–218
quality of analytics, 219–222
semantic understanding of data, 190–209 API key (OAuth), 91, 138
API requests Facebook, 46–59 GitHub, 281–287 Google+, 136–147
411
LinkedIn, 90–96 Twitter, 12–15
approximate matching (see clustering LinkedIn data)
arbitrary arguments, 20
*args (Python), 20 Aristotle, 342 Atom feed, 184 authorizing applications
accessing Gmail, 269–271 Facebook, 48
GitHub API, 286–287 Google+ API, 138–147 LinkedIn API, 91–96 Twitter and, 13–15, 353–357 avatars, 141
B
B-trees, 264
bag of words model, 190 Bayesian classifier, 223
BeautifulSoup Python package, 144, 185 betweenness graph metric, 297 big data
about, 189
big graph databases, 291 map-reduce and, 246 Big O notation, 99, 264
BigramAssociationMeasures Python class, 114 BigramCollocationFinder function, 172 bigrams, 113, 167–177
Bing geocoding service, 109 binomial distribution, 176 bipartite analysis, 315 boilerplate detection, 183–184 bookmarking projects, 282 bot policy, 326
bounded breadth-first searches, 187 breadth-first searches, 186–190 Brown Corpus, 157
C
Cantor, George, 19 cartograms, 109–112 central limit theorem, 175 centrality measures
application of, 303–306 betweenness, 297
closeness, 297
computing for graphs, 296–299 degree, 296
online resources, 320 centroid (clusters), 125 chi-square test, 176 chunking (NLP), 194 circles (Google+), 137 cleanHTML function, 144 clique detection
Facebook, 78–85
NetworkX Python package, 312 closeness graph metric, 297 cluster Python package, 120, 127 clustering LinkedIn data
about, 97–100
clustering algorithms, 115–130 dimensionality reduction and, 98 greedy clustering, 115–120 hierarchical clustering, 120–124 k-means clustering, 124–125 measuring similarity, 98, 112–114 normalizing data to enable analysis, 101 online resources, 133
recommended exercises, 133
visualizing with Google Earth, 127–130 clustering posts with cosine similarity, 163–166 collections Python module
about, 30
Counter class, 30, 69, 72, 114, 287, 372 collective intelligence, 9
collocations
computing, 167–171 n-gram similarity, 113, 167 comments (Google+), 137, 142 Common Crawl Corpus, 186, 323 company names (LinkedIn data), 101–103 confidence intervals, 219
Connections API (LinkedIn), 93 consumer key (OAuth), 13, 354–357, 405 consumer secret (OAuth), 13, 354–357, 405 content field (Google+), 144
context, human language data and, 177 contingency tables, 169–177
converting
mail corpus to Unix mailbox, 235–236 mailboxes to JSON, 236–240
cosine similarity about, 160–163
412 | Index
clustering posts with, 163–166 visualizing with matrix diagram, 166 CouchDB, 246
Counter class
Facebook and, 69, 72 GitHub and, 287 LinkedIn and, 114 Twitter and, 30, 372 CSS query selectors, 335 CSV file format, 96 csv Python module, 96, 373 cursors (Twitter API), 359 CVS version control system, 279
D
D3.js toolkit, 83, 109, 166, 316 Data Science Toolkit, 132 DataSift platform, 382
date/time range, query by, 247–250 datetime function, 250
dateutil Python package, 235 DBPedia initiative, 347
deduplication (see clustering LinkedIn data) degree graph metric, 296
degree of nodes in graphs, 290 dendograms, 122–124 density of graphs, 290 depth-first searches, 186 dereferencing, 102 Dice’s coefficient, 175
digraphs (directed graphs), 78–85, 288–291 dimensionality reduction, 98
dir Python function, 287
directed graphs (digraphs), 78–85, 288–291 distributed version control systems, 279 document summarization, 200–209
document-oriented databases (see MongoDB) dollar sign ($-MongoDB operator), 248 Dorling Cartogram, 109–112
double list comprehension, 28 dynamic programming, 121
E
edit distance, 113
ego (social networks), 49, 75–78, 293 ego graphs, 49, 293–296
email Python package, 230, 235
end-of-sentence (EOS) detection, 192, 193, 196–
Enron corpus200 about, 226, 246
advanced queries, 255–259
analyzing sender/recipient patterns, 250–255 getting Enron data, 232–234
online resources, 276
query by date/time range, 247–250 entities
interactions between, 215–218 property graphs representing, 288–291 entities field (tweets), 26, 368
entity extraction, 195, 211
entity resolution (entity disambiguation), 67 entity-centric analysis, 209–218
envoy Python package, 241
EOS (end-of-sentence) detection, 192, 193, 196–
extracting tweet entities, 28, 368, 371, 381200 extraction (NLP), 195, 211
F
F1 score, 219 Facebook, 46
(see also Social Graph API) about, 45–47
analyzing connections, 59–85 interest graphs and, 45, 292 online resources, 86 recommended exercises, 85 Facebook accounts, 46, 47 Facebook pages, analyzing, 63–70 Facebook Platform Policies document, 47 facebook Python package, 54, 71 Facebook Query Language (FQL), 47, 53 false negatives, 220
false positives, 220
favorite_count field (tweets), 27, 370 feedparser Python package, 184, 196
field expansion feature (Social Graph API), 53 fields
Facebook Social Graph API, 49 Google+ API, 144
LinkedIn API, 96 MongoDB, 260 Twitter API, 26–27, 359
find function (Python), 171, 244, 255 Firefox Operator add-on, 330
Index | 413
folksonomies, 9 following model
GitHub, 299–310 interest graphs and, 292 Twitter, 5, 7, 10, 46, 382–385, 388 forked projects, 281
forward chaining, 342
FQL (Facebook Query Language), 47, 53 frequency analysis
document summarization, 200–209 Facebook data, 63–85
LinkedIn data, 101–109 TF-IDF, 147–155
Twitter data, 29–32, 36–41, 373 Zipf’s law, 157–157
friendship graphs, 388 friendship model
Facebook, 46, 49, 70–85 Twitter, 8, 382–385, 388 Friendster social network, 319 functools.partial function, 361, 378 FuXi reasoning system, 342
fuzzy matching (see clustering LinkedIn data)
G
geo microformat, 323, 326–330 geocoding service (Bing), 109 geocoordinates, 323, 325–330 GeoJSON, 132
geopy Python package, 107 Gephi open source project, 316 GET search/tweets resource, 20–22 GET statuses/retweets resource, 36 GET trends/place resource, 17 Git version control system, 279, 280 GitHub
about, 279
following model, 299–310 online resources, 320 recommended exercises, 318 social coding, 279
GitHub API about, 281
analyzing interest graphs, 292–318 creating connections, 282–286 making requests, 286–287
modeling data with property graphs, 288–
online resources, 320291
recommended exercises, 319 terminology, 281
gitscm.com, 280 Gmail
accessing with OAuth, 269–271 visualizing patterns in, 273–274 GNU Prolog, 341
Google API Console, 138 Google Earth, 127–130, 329 Google Knowledge Graph, 190 Google Maps, 127, 327
Google Structured Data Testing Tool, 336–338 Google+ accounts, 136
Google+ API about, 136–138
making requests, 138–147 online resources, 180
querying human data language, 155–178 recommended exercises, 179
terminology, 137 TF-IDF and, 147–155
google-api-python-client package, 140 Graph API (Facebook) (see Social Graph API
(Facebook))
Graph API Explorer app, 47, 48–54 Graph Search project (Facebook), 56
Graph Your Inbox Chrome extension, 273–274 GraphAPI class (facebook Python package)
get_connections() method, 59 get_object() method, 59, 64, 71 get_objects() method, 59 request() method, 59 Graphviz, 316
greedy clustering, 115–120
H
hangouts (Google+), 137 hashtags (tweets)
about, 9, 21 extracting, 28
frequency data in histograms, 38–40 lexical diversity of, 34
hCalendar microformat, 323, 336 hCard microformat, 323, 336
help Python function, 12, 140, 155, 287 hierarchical clustering, 120–124 HierarchicalClustering Python class, 122 histograms
frequency data for tweets, 36–41 414 | Index