Analyzing Your Own Mail Data 268

Một phần của tài liệu Mining the social web, 2nd edition (Trang 294 - 300)

Part I. A Guided Tour of the Social Web Prelude

6. Mining Mailboxes: Analyzing Who’s Talking to Whom About What, How Often, and More

6.5. Analyzing Your Own Mail Data 268

The Enron mail data makes for great illustrations in a chapter on mail analysis, but you’ll probably also want to take a closer look at your own mail data. Fortunately, many popular mail clients provide an “export to mbox” option, which makes it pretty simple to get your mail data into a format that lends itself to analysis by the techniques described in this chapter.

For example, in Apple Mail, you can select some number of messages, pick “Save As…”

from the File menu, and then choose “Raw Message Source” as the formatting option to export the messages as an mbox file (see Figure 6-2). A little bit of searching should turn up results for how to do this in most other major clients.

Figure 6-2. Most mail clients provide an option for exporting your mail data to an mbox archive

If you exclusively use an online mail client, you could opt to pull your data down into a mail client and export it, but you might prefer to fully automate the creation of an mbox file by pulling the data directly from the server. Just about any online mail service will support POP3 (Post Office Protocol, version 3), most also support IMAP (Internet

268 | Chapter 6: Mining Mailboxes: Analyzing Who’s Talking to Whom About What, How Often, and More

Message Access Protocol), and it’s not hard to whip up Python scripts for pulling down your mail.

One particularly robust command-line tool that you can use to pull mail data from just about anywhere is getmail, which turns out to be written in Python. Two modules included in Python’s standard library, poplib and imaplib, provide a terrific founda‐

tion, so you’re also likely to run across lots of useful scripts if you do a bit of searching online. getmail is particularly easy to get up and running. To retrieve your Gmail inbox data, for example, you just download and install it, then set up a getmailrc configuration file.

The following sample settings demonstrate some settings for a *nix environment. Win‐

dows users would need to change the [destination] path and [options] mes sage_log values to valid paths, but keep in mind that you could opt to run the script on the virtual machine for this book if you needed a quick fix for a *nix environment:

[retriever]

type = SimpleIMAPSSLRetriever server = imap.gmail.com username = ptwobrussell password = xxx

[destination]

type = Mboxrd

path = /tmp/gmail.mbox [options]

verbose = 2

message_log = ~/.getmail/gmail.log

With a configuration in place, simply invoking getmail from a terminal does the rest.

Once you have a local mbox on hand, you can analyze it using the techniques you’ve learned in this chapter. Here’s what getmail looks like while it’s in action slurping down your mail data:

$ getmail

getmail version 4.20.0

Copyright (C) 1998-2009 Charles Cazabon. Licensed under the GNU GPL version 2.

SimpleIMAPSSLRetriever:ptwobrussell@imap.gmail.com:993:

msg 1/10972 (4227 bytes) from ... delivered to Mboxrd /tmp/gmail.mbox msg 2/10972 (3219 bytes) from ... delivered to Mboxrd /tmp/gmail.mbox ...

6.5.1. Accessing Your Gmail with OAuth

In early 2010, Google announced OAuth access to IMAP and SMTP in Gmail. This was a significant announcement because it officially opened the door to “Gmail as a plat‐

form,” enabling third-party developers to build apps that can access your Gmail data without you needing to give them your username and password. This section won’t get 6.5. Analyzing Your Own Mail Data | 269

6. If you’re just hacking your own Gmail data, using the anonymous consumer credentials generated from xoauth.py is just fine; you can always register and create a “trusted” client application at a later time should you need to do so.

into the particular nuances of how Xoauth, Google’s particular implementation of OAuth, works (see Appendix B for a terse introduction to OAuth in general); instead, it focuses on getting you up and running so that you can access your Gmail data, which involves just a few simple steps:

1. Select the “Enable IMAP” option under the “Forwarding and POP/IMAP” tab in your Gmail Account Settings.

2. Visit the Google Mail Xoauth Tools wiki page, download the xoauth.py command- line utility, and follow the instructions to generate an OAuth token and secret for an “anonymous” consumer.6

3. Install python-oauth2 via pip install oauth2 and use the template in Example 6-17 to establish a connection.

Example 6-17. Connecting to Gmail with Xoauth

import sys

import oauth2 as oauth

import oauth2.clients.imap as imaplib

# See http://code.google.com/p/google-mail-xoauth-tools/wiki/

# XoauthDotPyRunThrough for details on obtaining and

# running xoauth.py to get the credentials OAUTH_TOKEN = '' # XXX: Obtained with xoauth.py OAUTH_TOKEN_SECRET = '' # XXX: Obtained with xoauth.py

GMAIL_ACCOUNT = '' # XXX: Your Gmail address - example@gmail.com url = 'https://mail.google.com/mail/b/%s/imap/' % (GMAIL_ACCOUNT, )

# Standard values for Gmail's Xoauth

consumer = oauth.Consumer('anonymous', 'anonymous') token = oauth.Token(OAUTH_TOKEN, OAUTH_TOKEN_SECRET) conn = imaplib.IMAP4_SSL('imap.googlemail.com') conn.debug = 4 # Set to the desired debug level conn.authenticate(url, consumer, token)

conn.select('INBOX')

# Access your INBOX data

Once you’re able to programmatically access your mailbox, the next step is to fetch and parse some message data. The great thing about this is that we’ll format and export it

270 | Chapter 6: Mining Mailboxes: Analyzing Who’s Talking to Whom About What, How Often, and More

to exactly the same specification that we’ve been working with so far in this chapter, so all of your scripts and tools will work on both the Enron corpus and your own mail data!

6.5.2. Fetching and Parsing Email Messages with IMAP

The IMAP protocol is a fairly finicky and complex beast, but the good news is that you don’t have to know much of it to search and fetch mail messages. Furthermore, imaplib- compliant examples are readily available online.

One of the more common operations you’ll want to do is search for messages. There are various ways that you can construct an IMAP query. An example of how you’d search for messages from a particular user is conn.search(None, '(FROM "me")'), where None is an optional parameter for the character set and '(FROM "me")' is a search command to find messages that you’ve sent yourself (Gmail recognizes “me” as the authenticated user). A command to search for messages containing “foo” in the subject would be '(SUBJECT "foo")', and there are many additional possibilities that you can read about in Section 6.4.4 of RFC 3501, which defines the IMAP specification. ima plib returns a search response as a tuple that consists of a status code and a string of space-separated message IDs wrapped in a list, such as ('OK', ['506 527 566']). You can parse out these ID values to fetch RFC 822-compliant mail messages, but alas, there’s additional work involved to parse the content of the mail messages into a usable form.

Fortunately, with some minimal adaptation we can reuse the code from Example 6-3, which used the email module to parse messages into a more readily usable form, to take care of the uninteresting email-parsing cruft that’s necessary to get usable text from each message. Example 6-18 illustrates this.

Example 6-18. Query your Gmail inbox and store the results as JSON

import sys import mailbox import email import quopri import json import time

from BeautifulSoup import BeautifulSoup from dateutil.parser import parse

# What you'd like to search for in the subject of your mail.

# See Section 6.4.4 of http://www.faqs.org/rfcs/rfc3501.html

# for more SEARCH options.

Q = "Alaska" # XXX

# Recycle some routines from Example 6-3 so that you arrive at the

# very same data structure you've been using throughout this chapter

6.5. Analyzing Your Own Mail Data | 271

def cleanContent(msg):

# Decode message from "quoted printable" format msg = quopri.decodestring(msg)

# Strip out HTML tags, if any are present.

# Bail on unknown encodings if errors happen in BeautifulSoup.

try:

soup = BeautifulSoup(msg) except:

return ''

return ''.join(soup.findAll(text=True)) def jsonifyMessage(msg):

json_msg = {'parts': []}

for (k, v) in msg.items():

json_msg[k] = v.decode('utf-8', 'ignore')

# The To, Cc, and Bcc fields, if present, could have multiple items.

# Note that not all of these fields are necessarily defined.

for k in ['To', 'Cc', 'Bcc']:

if not json_msg.get(k):

continue

json_msg[k] = json_msg[k].replace('\n', '').replace('\t', '')\

.replace('\r', '').replace(' ', '')\

.decode('utf-8', 'ignore').split(',') for part in msg.walk():

json_part = {}

if part.get_content_maintype() == 'multipart':

continue

json_part['contentType'] = part.get_content_type()

content = part.get_payload(decode=False).decode('utf-8', 'ignore') json_part['content'] = cleanContent(content)

json_msg['parts'].append(json_part)

# Finally, convert date from asctime to milliseconds since epoch using the # $date descriptor so it imports "natively" as an ISODate object in MongoDB.

then = parse(json_msg['Date'])

millis = int(time.mktime(then.timetuple())*1000 + then.microsecond/1000) json_msg['Date'] = {'$date' : millis}

return json_msg

# Consume a query from the user. This example illustrates searching by subject.

(status, data) = conn.search(None, '(SUBJECT "%s")' % (Q, )) ids = data[0].split()

272 | Chapter 6: Mining Mailboxes: Analyzing Who’s Talking to Whom About What, How Often, and More

messages = []

for i in ids:

try:

(status, data) = conn.fetch(i, '(RFC822)')

messages.append(email.message_from_string(data[0][1])) except Exception, e:

print e

print 'Print error fetching message %s. Skipping it.' % (i, ) print len(messages)

jsonified_messages = [jsonifyMessage(m) for m in messages]

# Separate out the text content from each message so that it can be analyzed.

content = [p['content'] for m in jsonified_messages for p in m['parts']]

# Content can still be quite messy and contain line breaks and other quirks.

filename = os.path.join('resources/ch06-mailboxes/data',

GMAIL_ACCOUNT.split("@")[0] + '.gmail.json') f = open(filename, 'w')

f.write(json.dumps(jsonified_messages)) f.close()

print >> sys.stderr, "Data written out to", f.name

Once you’ve successfully parsed out the text from the body of a Gmail message, some additional work will be required to cleanse the text to the point that it’s suitable for a nice display or advanced NLP, as illustrated in Chapter 5. However, not much effort is required to get it to the point where it’s clean enough for collocation analysis. In fact, the results of Example 6-18 can be fed almost directly into Example 4-12 to produce a list of collocations from the search results. A worthwhile visualization exercise would be to create a graph plotting the strength of linkages between messages based on the number of bigrams they have in common, as determined by a custom metric.

6.5.3. Visualizing Patterns in GMail with the “Graph Your Inbox”

Chrome Extension

There are several useful toolkits floating around that analyze webmail, and one of the most promising to emerge in recent years is the Graph Your Inbox Chrome extension.

To use this extension, you just install it, authorize it to access your mail data, run some Gmail queries, and let it take care of the rest. You can search for keywords like “pizza,”

search for time values such as “2010,” or run more advanced queries such as “from:mat‐

thew@zaffra.com” and “label:Strata”. Figure 6-3 shows a sample screenshot.

6.5. Analyzing Your Own Mail Data | 273

Figure 6-3. The Graph Your Inbox Chrome extension provides a concise summary of your Gmail activity

What’s especially remarkable is that you can readily reproduce all of the analytics that this extension provides with the techniques you’ve learned in this chapter plus some supplemental content from earlier chapters, such as the use of a JavaScript visualization library like D3.js or matplotlib’s plotting utilities within IPython Notebook. Your tool‐

box is full of scripts and techniques that can be readily applied to a data domain to produce a comparable dashboard, whether it be a mailbox, an archive of web pages, or a collection of tweets. You certainly have some careful thinking to do about designing an overall application so that it provides an enjoyable user experience, but the building blocks for the data science and analysis that would be presented to the user are in your grasp.

Một phần của tài liệu Mining the social web, 2nd edition (Trang 294 - 300)

Tải bản đầy đủ (PDF)

(448 trang)