Part I. A Guided Tour of the Social Web Prelude
8. Mining the Semantically Marked-Up Web: Extracting Microformats, Inferencing over RDF, and More
8.2. Microformats: Easy-to-Implement Metadata 322
8.2.1. Geocoordinates: A Common Thread for Just About Anything 325
The implications of using microformats are subtle yet somewhat profound: while a human might be reading an article about a place like Franklin, Tennessee and intuitively know that a dot on a map on the page denotes the town’s location, a robot could not reach the same conclusion easily without specialized logic that targets various pattern- matching possibilities. Such page scraping is a messy proposition, and typically just when you think you have all of the possibilities figured out, you find that you’ve missed one. Embedding proper semantics into the page that effectively tag unstructured data in a way that even Robby the Robot could understand removes ambiguity and lowers the bar for crawlers and developers. It’s a win-win situation for the producer and the consumer, and hopefully the net effect is increased innovation for everyone.
Although it’s certainly true that standalone geodata isn’t particularly social, important but nonobvious relationships often emerge from disparate data sets that are tied together with a common geographic context.
Geodata is ubiquitous. It plays a powerful part in too many social mashups to name, because a particular point in space can be used as the glue for clustering people together.
The divide between “real life” and life on the Web continues to close, and just about any kind of data becomes social the moment that it is tied to a particular individual in the 8.2. Microformats: Easy-to-Implement Metadata | 325
real world. For example, there’s an awful lot that you might be able to tell about people based on where they live and what kinds of food they like. This section works through some examples of finding, parsing, and visualizing geo-microformatted data.
One of the simplest and most widely used microformats that embeds geolocation in‐
formation into web pages is appropriately called geo. The specification is inspired by a property with the same name from vCard, which provides a means of specifying a location. There are two possible means of embedding a microformat with geo. The following HTML snippet illustrates the two techniques for describing Franklin, Ten‐
nessee:
<!-- The multiple class approach -->
<span style="display: none" class="geo">
<span class="latitude">36.166</span>
<span class="longitude">-86.784</span>
</span>
<!-- When used as one class, the separator must be a semicolon -->
<span style="display: none" class="geo">36.166; -86.784</span>
As you can see, this microformat simply wraps latitude and longitude values in tags with corresponding class names, and packages them both inside a tag with a class of geo. A number of popular sites, including Wikipedia and OpenStreetMap, use geo and other microformats to expose structured data in their pages.
A common practice with geo is to hide the information that’s enco‐
ded from the user. There are two ways that you might do this with traditional CSS: style="display: none" and style="visibility:
hidden". The former removes the element’s placement on the page entirely so that the layout behaves as though it is not there at all. The latter hides the content but reserves the space it takes up on the page.
Example 8-1 illustrates a simple program that parses geo-microformatted data from a Wikipedia page to show how you could extract coordinates from content implementing the geo microformat. Note that Wikipedia’s terms of use define a bot policy that you should review prior to attempting to retrieve any content with scripts such as the fol‐
lowing. The gist is that you’ll need to download data archives that Wikipedia periodically updates as opposed to writing bots to pull nontrivial volumes of data from the live site.
(It’s fine for us to yank a web page here for educational purposes.)
As should always be the case, carefully review a website’s terms of service to ensure that any scripts you run against it comply with its latest guidelines.
326 | Chapter 8: Mining the Semantically Marked-Up Web: Extracting Microformats, Inferencing over RDF, and More
Example 8-1. Extracting geo-microformatted data from a Wikipedia page
import requests # pip install requests
from BeautifulSoup import BeautifulSoup # pip install BeautifulSoup
# XXX: Any URL containing a geo microformat...
URL = 'http://en.wikipedia.org/wiki/Franklin,_Tennessee'
# In the case of extracting content from Wikipedia, be sure to
# review its "Bot Policy," which is defined at
# http://meta.wikimedia.org/wiki/Bot_policy#Unacceptable_usage
req = requests.get(URL, headers={'User-Agent' : "Mining the Social Web"}) soup = BeautifulSoup(req.text)
geoTag = soup.find(True, 'geo') if geoTag and len(geoTag) > 1:
lat = geoTag.find(True, 'latitude').string lon = geoTag.find(True, 'longitude').string print 'Location is at', lat, lon
elif geoTag and len(geoTag) == 1:
(lat, lon) = geoTag.string.split(';') (lat, lon) = (lat.strip(), lon.strip()) print 'Location is at', lat, lon else:
print 'No location found'
The following sample results illustrate that the output is just a set of coordinates, as expected:
Location is at 35.92917 -86.85750
To make the output a little bit more interesting, however, you could display the results directly in IPython Notebook with an inline frame, as shown in Example 8-2.
Example 8-2. Displaying geo-microformats with Google Maps in IPython Notebook
from IPython.display import IFrame from IPython.core.display import display
# Google Maps URL template for an iframe
google_maps_url = "http://maps.google.com/maps?q={0}+{1}&" + \ "ie=UTF8&t=h&z=14&{0},{1}&output=embed".format(lat, lon) display(IFrame(google_maps_url, '425px', '350px'))
Sample results after executing this call in IPython Notebook are shown in Figure 8-1.
8.2. Microformats: Easy-to-Implement Metadata | 327
Figure 8-1. IPython Notebook’s ability to display inline frames can add a lot of interac‐
tivity and convenience to your experiments in data analysis
The moment you find a web page with compelling geodata embedded, the first thing you’ll want to do is visualize it. For example, consider the “List of National Parks of the United States” Wikipedia article. It displays a nice tabular view of the national parks and marks them up with geoformatting, but wouldn’t it be nice to quickly load the data into an interactive tool for visual inspection? A terrific little web service called micro‐
form.at extracts several types of microformats from a given URL and passes them back in a variety of useful formats. It exposes multiple options for detecting and interacting with microformat data in web pages, as shown in Figure 8-2.
328 | Chapter 8: Mining the Semantically Marked-Up Web: Extracting Microformats, Inferencing over RDF, and More
Figure 8-2. microform.at’s results for the Wikipedia article entitled “List of National Parks of the United States”
If you’re given the option, KML (Keyhole Markup Language) output is one of the more ubiquitous ways to visualize geodata. You can either download Google Earth and load the KML file locally, or type a URL containing KML data directly into the Google Maps search bar to bring it up without any additional effort required. In the results displayed for microform.at, clicking on the “KML” link triggers a file download that you can use in Google Earth, but you can copy it to the clipboard via a right-click and pass that to Google Maps.
Figure 8-3 displays the Google Maps visualization for http://microform.at/?
type=geo&url=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FList_of_U.S._nation al_parks—the KML results for the aforementioned Wikipedia article, which is just the base URL http://microform.at with type and url query string parameters.
8.2. Microformats: Easy-to-Implement Metadata | 329
Figure 8-3. Google Maps results that display all of the national parks in the United States when passed KML results from microform.at
The ability to start with a Wikipedia article containing semantic markup such as geodata and trivially visualize it is a powerful analytical capability because it delivers insight quickly for so little effort. Browser extensions such as the Firefox Operator add-on aim to minimize the effort even further. Only so much can be said in one chapter, but a neat way to spend an hour or so would be to mash up the national park data from this section with contact information from your LinkedIn professional network to discover how you might be able to have a little bit more fun on your next (possibly contrived) business trip. (See Section 3.3.4.4 on page 127 for an example of how to harvest and analyze geodata by applying the k-means technique for finding clusters and computing cent‐
roids for those clusters.)
330 | Chapter 8: Mining the Semantically Marked-Up Web: Extracting Microformats, Inferencing over RDF, and More