Apress - Smart Home Automation with Linux (2010)- P42 pps

CHAPTER 6 ■ DATA SOURCES 188 Once you are able to describe the location of the data in human terms, you can start writing the code! The process involves a mechanized agent that is able to load the web page and traverse links and a stream processor that skips over the HTML tags. You begin the scraping with a fairly common loading block like this: #!/usr/bin/perl -w use strict; use WWW::Mechanize; use HTML::TokeParser; my $agent = WWW::Mechanize->new(); $agent->get("http://www.minervahome.net/news.htm"); my $stream = HTML::TokeParser->new(\$agent->{content}); Given the $stream, you can now skip to the fourth table, for example, by jumping over four of the opening table tags using the following: foreach(1 4) { $stream->get_tag("table"); } Notice that get_tag positions the stream point immediately after the opening tag given, in this case table. Consequently, the stream point is now inside the fourth table. Since our data is on the first row, you don’t need to worry about skipping the tr tag, so you can jump straight into the second column with this: $stream->get_tag("td"); $stream->get_tag("td"); since skipping the td tag will automatically skip the preceding tr. The stream is now positioned exactly where you want it. The HTML structure of this block is as follows: <a href="url">Main title</a></td> <td valign="top"> Main story text So far, I have been using get_tag to skip elements, but it also sports a return value, containing the contents of the tag. So, you’d retrieve the information from the anchor with the following, which, by its nature, can return multiple tags: my @link = $stream->get_tag("a"); Since you know there is only one in this particular HTML, it is $link[0] that is of interest. Inside this is another array containing the following: $link[0][0] # tag $link[0][1] # attributes $link[0][2] # attribute sequence $link[0][3] # text CHAPTER 6 ■ DATA SOURCES 189 Therefore, you can extract the link information with the following: my $href = $link[0][1]{href}; And since get_tag only retrieves the information about the tag, you must return to the stream to extract all the data between this <a> and the </a>. my $storyHeadline = $stream->get_trimmed_text("/a"); From here, you can see that you need to skip the next opening td tag and get the story text between it and the next closing td tag: $stream->get_tag("td"); print $stream->get_trimmed_text("/td"); Since you are only getting the first story from the page, your scraping is done. If you wanted to get the first two stories, for example, you’d need to correctly skip the remainder of this table, or row, before repeating the parse loop again. Naturally, if this web page changes in any way, the code won’t work! Fortunately, this game of cat and mouse between the web developers and the screen scrapers often comes to a pleasant end. For us! Tired with redesigning their sites every week and in an attempt to connect with the Web 2.0 and mash-up communities on the Web, many companies are providing APIs to access their data. And like most good APIs, they remain stable between versions. Data Through APIs An API is the way a programmer can interact with the operating system underneath it. In the web world, an API governs how your scripts can retrieve (and sometimes change) the data on a web server. These break down into several broad groups: • Basic file access: These files are dispensed via a web server with a filename formatted according to some predetermined rules. To get the UK TV listings for BBC1 in three days time, for example, you can use the URL http://www.bleb.org/ tv/data/listings/3/bbc1.xml. In the truest sense of the word, these are not APIs, but unlike static files, the same request can produce different data according to the time or location from where they’re requested. • Public queries: These can exist in many forms, including basic file requests, but are usually based on Simple Object Access Protocol (SOAP) objects or XML over HTTP. This allows function calls, using strongly typed parameters, to be sent to the server with similarly complex replies returned using XML. • Private queries: These require the software developer to sign up for a developer API key. These, like the ones for Amazon, are embedded into your code so that the server API can authenticate the user and monitor your usage patterns, thereby eliminating most DoS attacks. There is no consistent legalese to these implementations. Just because a site uses publicly accessible files doesn’t necessarily mean that you can redistribute their data. Again, you must check their terms of service (TOS), which are not always obviously displayed. CHAPTER 6 ■ DATA SOURCES 190 In the case of private queries, the TOS will be displayed beforehand, and you will be required to agree to the terms before a key is assigned to you. These terms will typically limit you to a specific number of accesses per day or within a particular time frame. Usually these limits can be increased with the exchange of currency. If you are looking for APIs to experiment with, then a good starting point is http://www.programmableweb.com/apis. Distribution Unless it is explicitly stated otherwise, any data you generate is considered a derived work of the original copyrighted version. I have merely demonstrated methods by which this data can be obtained (and obtained for personal use only). After all, in most cases, the copyright holders have given their permission for the data to be used on the sites in question but not redistributed beyond that. The letter of the law includes redistribution inside your home, but in most cases (where the home server is private and unavailable to the outside world), it becomes a moot point. Public Data In this section, I’ll cover data that is available to the public. It is not necessarily available in the public domain, however, so you must still adhere to all the rules of legality mentioned previously. Within each section I’ll cover some example data that will be useful to your smart home, examine how to access and process it, and talk about ideas of how public data can be incorporated privately at home. TV Guides With so many TV stations in so many countries, building a general-purpose data store for all the TV channels (let alone their programs) in the world is a massive undertaking. In the United Kingdom, you have Andrew Flegg to thank for handling all the digital, analog, and primary satellite stations in England, Scotland, Wales, and Northern Ireland. The data presented on this site comes from daily scrapes of the broadcasters’ own web sites, along with traditional data feeds, so it is accurate and timely. It is also legal, since permission has been granted to the site for its inclusion. ■ Note Curiously, the data for ITV is not available. This is because ITV doesn’t want its data shared on other web sites, although it has no objection to using other TV schedule data on its own site! This might be because of the purely commercial aspect of its business. However, until ITV changes its rules (or the petition takes effect), no geek following these instructions will be able to determine what’s on ITV, which in turn will limit ITV’s advertising revenues, causing them to have shot themselves in the proverbial foot! The data itself is available as a web page on the site or as XML files that can be downloaded to your PC and processed at your leisure. The URLs for each XML file follow a strict format so that you can automate the process. CHAPTER 6 ■ DATA SOURCES 191 The root URL is http://www.bleb.org/tv/data/listings and is followed by this: • The day offset, between -1 (yesterday) and 6 (next week) • The station name Therefore, you can find today’s BBC 1 schedule here: http://www.bleb.org/tv/data/listings/0/bbc1.xml And tomorrow’s TMF guide is here: http://www.bleb.org/tv/data/listings/1/tmf.xml The format is XMLTV and very easy to parse with a suitable library or even XSLT. With this data in a local usable format, you can then search the descriptions for films starring your favorite actors, be alerted to new series, or check for musicians appearing within talk shows or programs outside your usual sphere. These results can be piped into any file, such as a web page or e-mail, for review. Despite its free nature, Bleb does have a couple of restrictions of its own, but the only requirements are that you don’t repeatedly ask for data from the server by including a two-second gap between requests and that you include the program name that’s making these requests along with an e-mail. Minerva includes an example of this in action and is covered in Chapter 7. There are many other examples, such as executables for Windows, Flash code, and the WhensItOn code found here: http://ccgi.useyourhead.force9.co.uk/ This alphabetically sorts the entire week’s TV schedule so you can see at what time the show is on and when it’s repeated. Train Times Like the TV schedules, obtaining complete timetables for every train around the world is a thankless and impossible task. But fortunately, like TV, the rail journeys of interest are generally based in one country, so you need to find only one suitable data source for your area. Any search engine will return several different data sources for this information, depending on your country, so you need to spend a little time looking at each one to determine which have APIs (that are usable) or, failing that, which can be scraped with the least amount of effort. Most people who use trains regularly have a routine, where they know the timetables, so the web sites of most interest are those that report the live information, including late trains and cancellations. In England, the foremost site is Live Departure Boards (www.nationalrail.co.uk/times_fares/ldb/), which provides reasonably accurate information about most trains on the U.K. network. It doesn’t include an API, unfortunately, but it is very easy to scrape for the current train times and also comes with a Twitter feed detailing the station closures and the overrunning of engineering works. It also has the advantage of using a basic GET request to retrieve the times of all the trains between two named stations, making it easier to bookmark. One journey I make on occasion is between St. Pancras and Luton Airport. CHAPTER 6 ■ DATA SOURCES 192 On reviewing the site, I can see that the URL incorporates both of these locations (?T=STP&S=LTN) and can be incorporated into the code 4 for whattrain.pl like this: my $agent = WWW::Mechanize->new(); $agent->get("http://realtime.nationalrail.co.uk/ldb/sumdep.aspx?T=STP&S=LTN"); Looking at the HTML source, there is a suspiciously unique row2 class identifier for each of the rows detailing the trains. So, you need a simple loop to check for this: while ($tag = $stream->get_tag("tr")) { if (defined $tag->[1]{class} && $tag->[1]{class} eq "row2") { which in turn allows you to read the information of each train: $stream->get_tag("td"); my $destination = $stream->get_trimmed_text("/td"); $stream->get_tag("td"); my $platform = $stream->get_trimmed_text("/td"); $stream->get_tag("td"); my $expectedTime = $stream->get_trimmed_text("/td"); $stream->get_tag("td"); my $arrivalTime = $stream->get_trimmed_text("/td"); This web site has been used, in part, because enough information is present to make automatic value judgments when it comes to catching the train. For example, knowing that it takes 35 minutes to travel from work to St. Pancras, you can exclude any train leaving in that window. Furthermore, by adding a grace period, you can limit the output to trains that will leave within ten minutes of you arriving at the station: my $graceMinutes = $minutesAway - $timeToStation; if ($graceMinutes >= $graceThreshold && $graceMinutes < $maxGracePeriod) { print "Get the $expectedTime to $destination from platform $platform."; print "There is $graceMinutes minutes grace.\n"; } This code can naturally be extended to swap the source and destination locations so that the return journey can be similarly considered. This could happen by looking at the time of day, for example. 4 This is screen-scraping Perl code, which may have broken by the time you read this. The relative pitfalls and considerations of this approach were covered earlier in the chapter. . scraping with a fairly common loading block like this: #!/usr/bin/perl -w use strict; use WWW::Mechanize; use HTML::TokeParser; my $agent = WWW::Mechanize->new(); $agent->get("http://www.minervahome.net/news.htm");. train: $stream->get_tag("td"); my $destination = $stream->get_trimmed_text("/td"); $stream->get_tag("td"); my $platform = $stream->get_trimmed_text("/td");. $stream->get_trimmed_text("/td"); $stream->get_tag("td"); my $expectedTime = $stream->get_trimmed_text("/td"); $stream->get_tag("td"); my $arrivalTime = $stream->get_trimmed_text("/td");

Định dạng
Số trang	5
Dung lượng	248,88 KB