CHAPTER 5 ■ COMMUNICATION 183 Post: Array ( [sender] => 012345678 [content] => Null Wow this might work, you know! [inNumber] => 447786202240 [submit] => Submit [network] => UNKNOWN [email] => none [keyword] => NULL [comments] => Wow this might work, you know! ) Both contain enough information to let you switch your lights on with a text message. The code is trivial as follows: if ($_POST['from'] == "012345678") { if ($_POST['text'] == "bedroom on") { system("/usr/local/bin/heyu turn bedroom_light on"); } else if ($_POST['text'] == "bedroom off") { system("/usr/local/bin/heyu turn bedroom_light off"); } } To eliminate the sending of fiddly text messages (and perhaps save money), you can test future permutations of this script with a simple web page. Using the simpler format, you can write code such as the following: <form action="echo.php" method="POST"> <input name="from" value="phone num"> <input name="text" value="your message here"> <input name="msgid" value="" type="hidden"> <input name="type" value="1" type="hidden"> <input type="submit" value="Send Fake SMS"> </form> Being on an open web server, there are some security issues. You eliminate one by having the phone number verified by a piece of code on the server (never validate credentials on the client). You can further limit another issue (although not eliminate it) by changing your simply named echo.php script to iuytvaevew.php, employing security through obscurity so that it is not accidentally found. Some providers will call your web page using HTTPS, which is the best solution and worth the extra time in setting up a specific username and password for them. You can rebalance the concepts of security and accessibility by allowing multiple phones to access the house, by creating a white list of mobile phone numbers, and by adding to this list explicitly. Or you could ban any access to your page from an IP that isn’t similarly approved and known to be your gateway provider. If you were likely to be communicating a lot through SMS, you could automatically add new phone numbers to a pending list of preapproved devices, which in turns sends a notification message to the SMS administrator, where they can issue a special command to add them onto the list. If your facilities allow, having a physical mobile phone connected through Gnokii may be useful in emergencies when you have no Internet connectivity and you want to be informed that the automatic power cycling of the router (with a AW12 perhaps, as mentioned in Chapter 1) is in progress. CHAPTER 5 ■ COMMUNICATION 184 Conclusion With so many ways of communicating into and out of a system, you must begin with a solid framework. My method is to separate the input systems from the processing, allowing any input mechanism (mobile phone, e-mails, or web interface) to generate a command in a known common format that can be processed by a single script. In a similar way, all messages are sent to a single script that then formats the message in a particular format, suitable for the given communication channel. You can also add an automatic process upon receipt of any, or all, of these messages. So, once you have code to control a video, light switch, or alarm clock, you can process them in any order to either e-mail your video, SMS your light switch, speak to your alarm clock, or do any combination thereof. C H A P T E R 6 ■ ■ ■ 185 Data Sources Making Homes Smart Although being able to e-mail your light switch is very interesting and infinitely cooler than programming yet another version of “Hello, World,” it never feels like an automatic house. After all, you as a human are controlling it. By providing your house with information about the real world, it is then able to make decisions for itself. This is the distinction between an automated home and a smart home. Why Data Is Important For years, the mantra “Content is king” has been repeated in every field of technology. Although most of the data in your home automation environment so far has been generated from your own private living patterns, there is still a small (but significant) amount of data that you haven’t generated, such as TV schedules. I’ll now cover this data to see what’s available and how you can (legally) make use of it. Legalities All data is copyrighted. Whether it is a table of rainfall over the past 20 years or the listing for tonight’s TV, any information that has been compiled by a human is afforded a copyright. The exception is where data has been generated by a computer program, in which case the source data is copyrighted by the individual who created it, and the copyright to the compiled version is held by the person who facilitated the computer to generate it, usually the person who paid for the machine. Unfortunately, all useful data falls into the first category. Even when the data is made publicly available, such as on a web site, or when it appears to be self-evident (such as the top ten music singles), the data still has a copyright attached to it, which requires you to have permission to use it. 1 Depending on jurisdiction, copyright will traditionally lapse 50 or 75 years after the death of the last surviving author. However, with the introduction of new laws, such as the Sonny Bono Copyright Term Extension Act, even these lengthy periods may be extended. In this field, the data becomes useless before it becomes available, which is unfortunate. 1 IANAL: I am not a lawyer, and all standard disclaimers apply here! CHAPTER 6 ■ DATA SOURCES 186 Fortunately, there are provisions for private use and study in most countries that allow you to process this data for your own personal use. Unfortunately, this does not include redistributing the data to others or manipulating the data into another format. This, from a purely technical and legal point of view, means that you can’t do the following: • Provide the data to others in your household. They have to download it themselves. This includes reproducing the information on a home page or distributing a TV or radio signal to other machines. • Improve the format of the data and provide it to others who are technically unable to do the same. This includes parsing the data from one web site to show it in a more compact format at home. There is even a questionable legality in some areas over whether you are allowed to provide tools that improve or change the format of existing (copyrighted) data. Fortunately, most companies turn a blind eye to this area, as they do for the internal distribution of data to members of your household—not that they’d know, or be able to prove, it if you did. The larger issue has to do with improvements to the data, since most data is either too raw or too complex to be useful. Let’s take a web site containing the weather forecast as an example; the raw data might include only the string “rain, 25,” which would need to be parsed into a nice icon and a temperature bar to be user-friendly. A complex report could include a friendly set of graphics on the original site but make the original data set unavailable to anyone else who either tries to load the report from another site through deep linking or tries to reference the source table data used to build the image. Screen Scraping This is the process whereby a web page is downloaded by a command-line tool, such as wget or cURL, and then processed by an HTML parser so that individual elements can be read and extracted from it. This is the most legally suspect and most troublesome method of processing information. It is the most suspect because you are downloading copyrighted content from a site in a manner that is against the site’s terms and conditions—so much so that, until fairly recently, one famous weather site labeled its images as please_dont_scrape_this_use_the_api.gif! Scraping is troublesome because it is very difficult to accurately parse a web page for content. It is very easy to parse the page on a technical level because the language is computer-based, and parsers already exist. It is also very easy for a user to parse the rendered page for the data, because the eye human will naturally seek out the information it desires. But knowing that the information is in the top- left corner of the screen is a very difficult thing for a machine to assess. Instead, most scrapers will work on a principle of blocking. This is where the information is known to exist in a particular block, determined beforehand by a programmer, and the parser blindly copies data from that block. For example, it will go to the web page, find the third table, look in the fifth column and second row, and read the data from the first paragraph tag. This is time-consuming to determine but easy to parse. It is troublesome because any breakages in the HTML format itself (either introduced intentionally by the CHAPTER 6 ■ DATA SOURCES 187 developers or introduced accidentally because of changes in advertising 2 ) will require the script to be modified or rewritten. Because of the number of different languages and libraries available to the would-be screen-scraper and the infinite number of (as yet undetermined) formats into which you’d like to convert the data, there isn’t really a database of known web sites with matching scraping code. To do so would be a massive undertaking. However, if you’re unable to program suitable scraping code, it might be best to seek out local groups or those communities based around the web site in question, such as TV fan pages. Any home will generally have a large number of data sources, and trying to maintain scrapers for each source will be time-consuming if you attempt it alone. The mechanics of scraping are best explained with an example. In this case, I’ll use Perl and the WWW::Mechanize and HTML::TokeParser modules. Begin by installing them in any way suitable for your distribution. I personally use the CPAN module, which generally autoconfigures itself upon invocation of the cpan command. Additional mirrors can be added by adding to the URL list like this: o conf urllist push ftp://ftp-mirror.internap.com/pub/CPAN/ o conf commit This is then followed by the installation of the modules themselves: perl -MCPAN -e 'install WWW::Mechanize' perl -MCPAN -e 'install HTML::TokeParser' Lest I advocate scraping a page of a litigious company, I will provide an example using my own Minerva site to retrieve the most recent story from the news page at http://www.minervahome.net/news.htm. Begin by loading the page in a web browser to get a feel for the page layout and to see where the target information is located. Also, review other pages to see whether there’s any commonality that can be exploited. You can do this by reviewing the source (as either a whole page or with a “view source selection” option) or enlisting the help of Firebug 3 to highlight the tables and subcomponents within the table. Then look for any “low-hanging fruit.” These are the easily solved parts of a problem, so you might find the desired text inside a specially named div element or included inside a table with a particular id attribute. Many professionally designed web sites do this to make redesigns quicker and unwittingly help the scraper. If there are no distinguishing features around the text, look to the elements surrounding it. And the elements surround those. Work outward until you find something unique enough to be of interest or you reach the root html node. If you’ve found nothing unique, then you will have to describe the data with code like “in the first row and second column of the third table.” 2 And although the Web exists as a free resource for information, someone will be paying for advertising space to offset the production costs. 3 Firebug is an extension to Firefox that allows web developers (and curious geeks) full access to the inner workings of the web pages that appear in the browser. CHAPTER 6 ■ DATA SOURCES 188 Once you are able to describe the location of the data in human terms, you can start writing the code! The process involves a mechanized agent that is able to load the web page and traverse links and a stream processor that skips over the HTML tags. You begin the scraping with a fairly common loading block like this: #!/usr/bin/perl -w use strict; use WWW::Mechanize; use HTML::TokeParser; my $agent = WWW::Mechanize->new(); $agent->get("http://www.minervahome.net/news.htm"); my $stream = HTML::TokeParser->new(\$agent->{content}); Given the $stream, you can now skip to the fourth table, for example, by jumping over four of the opening table tags using the following: foreach(1 4) { $stream->get_tag("table"); } Notice that get_tag positions the stream point immediately after the opening tag given, in this case table. Consequently, the stream point is now inside the fourth table. Since our data is on the first row, you don’t need to worry about skipping the tr tag, so you can jump straight into the second column with this: $stream->get_tag("td"); $stream->get_tag("td"); since skipping the td tag will automatically skip the preceding tr. The stream is now positioned exactly where you want it. The HTML structure of this block is as follows: <a href="url">Main title</a></td> <td valign="top"> Main story text So far, I have been using get_tag to skip elements, but it also sports a return value, containing the contents of the tag. So, you’d retrieve the information from the anchor with the following, which, by its nature, can return multiple tags: my @link = $stream->get_tag("a"); Since you know there is only one in this particular HTML, it is $link[0] that is of interest. Inside this is another array containing the following: $link[0][0] # tag $link[0][1] # attributes $link[0][2] # attribute sequence $link[0][3] # text CHAPTER 6 ■ DATA SOURCES 189 Therefore, you can extract the link information with the following: my $href = $link[0][1]{href}; And since get_tag only retrieves the information about the tag, you must return to the stream to extract all the data between this <a> and the </a>. my $storyHeadline = $stream->get_trimmed_text("/a"); From here, you can see that you need to skip the next opening td tag and get the story text between it and the next closing td tag: $stream->get_tag("td"); print $stream->get_trimmed_text("/td"); Since you are only getting the first story from the page, your scraping is done. If you wanted to get the first two stories, for example, you’d need to correctly skip the remainder of this table, or row, before repeating the parse loop again. Naturally, if this web page changes in any way, the code won’t work! Fortunately, this game of cat and mouse between the web developers and the screen scrapers often comes to a pleasant end. For us! Tired with redesigning their sites every week and in an attempt to connect with the Web 2.0 and mash-up communities on the Web, many companies are providing APIs to access their data. And like most good APIs, they remain stable between versions. Data Through APIs An API is the way a programmer can interact with the operating system underneath it. In the web world, an API governs how your scripts can retrieve (and sometimes change) the data on a web server. These break down into several broad groups: • Basic file access: These files are dispensed via a web server with a filename formatted according to some predetermined rules. To get the UK TV listings for BBC1 in three days time, for example, you can use the URL http://www.bleb.org/ tv/data/listings/3/bbc1.xml. In the truest sense of the word, these are not APIs, but unlike static files, the same request can produce different data according to the time or location from where they’re requested. • Public queries: These can exist in many forms, including basic file requests, but are usually based on Simple Object Access Protocol (SOAP) objects or XML over HTTP. This allows function calls, using strongly typed parameters, to be sent to the server with similarly complex replies returned using XML. • Private queries: These require the software developer to sign up for a developer API key. These, like the ones for Amazon, are embedded into your code so that the server API can authenticate the user and monitor your usage patterns, thereby eliminating most DoS attacks. There is no consistent legalese to these implementations. Just because a site uses publicly accessible files doesn’t necessarily mean that you can redistribute their data. Again, you must check their terms of service (TOS), which are not always obviously displayed. CHAPTER 6 ■ DATA SOURCES 190 In the case of private queries, the TOS will be displayed beforehand, and you will be required to agree to the terms before a key is assigned to you. These terms will typically limit you to a specific number of accesses per day or within a particular time frame. Usually these limits can be increased with the exchange of currency. If you are looking for APIs to experiment with, then a good starting point is http://www.programmableweb.com/apis. Distribution Unless it is explicitly stated otherwise, any data you generate is considered a derived work of the original copyrighted version. I have merely demonstrated methods by which this data can be obtained (and obtained for personal use only). After all, in most cases, the copyright holders have given their permission for the data to be used on the sites in question but not redistributed beyond that. The letter of the law includes redistribution inside your home, but in most cases (where the home server is private and unavailable to the outside world), it becomes a moot point. Public Data In this section, I’ll cover data that is available to the public. It is not necessarily available in the public domain, however, so you must still adhere to all the rules of legality mentioned previously. Within each section I’ll cover some example data that will be useful to your smart home, examine how to access and process it, and talk about ideas of how public data can be incorporated privately at home. TV Guides With so many TV stations in so many countries, building a general-purpose data store for all the TV channels (let alone their programs) in the world is a massive undertaking. In the United Kingdom, you have Andrew Flegg to thank for handling all the digital, analog, and primary satellite stations in England, Scotland, Wales, and Northern Ireland. The data presented on this site comes from daily scrapes of the broadcasters’ own web sites, along with traditional data feeds, so it is accurate and timely. It is also legal, since permission has been granted to the site for its inclusion. ■ Note Curiously, the data for ITV is not available. This is because ITV doesn’t want its data shared on other web sites, although it has no objection to using other TV schedule data on its own site! This might be because of the purely commercial aspect of its business. However, until ITV changes its rules (or the petition takes effect), no geek following these instructions will be able to determine what’s on ITV, which in turn will limit ITV’s advertising revenues, causing them to have shot themselves in the proverbial foot! The data itself is available as a web page on the site or as XML files that can be downloaded to your PC and processed at your leisure. The URLs for each XML file follow a strict format so that you can automate the process. CHAPTER 6 ■ DATA SOURCES 191 The root URL is http://www.bleb.org/tv/data/listings and is followed by this: • The day offset, between -1 (yesterday) and 6 (next week) • The station name Therefore, you can find today’s BBC 1 schedule here: http://www.bleb.org/tv/data/listings/0/bbc1.xml And tomorrow’s TMF guide is here: http://www.bleb.org/tv/data/listings/1/tmf.xml The format is XMLTV and very easy to parse with a suitable library or even XSLT. With this data in a local usable format, you can then search the descriptions for films starring your favorite actors, be alerted to new series, or check for musicians appearing within talk shows or programs outside your usual sphere. These results can be piped into any file, such as a web page or e-mail, for review. Despite its free nature, Bleb does have a couple of restrictions of its own, but the only requirements are that you don’t repeatedly ask for data from the server by including a two-second gap between requests and that you include the program name that’s making these requests along with an e-mail. Minerva includes an example of this in action and is covered in Chapter 7. There are many other examples, such as executables for Windows, Flash code, and the WhensItOn code found here: http://ccgi.useyourhead.force9.co.uk/ This alphabetically sorts the entire week’s TV schedule so you can see at what time the show is on and when it’s repeated. Train Times Like the TV schedules, obtaining complete timetables for every train around the world is a thankless and impossible task. But fortunately, like TV, the rail journeys of interest are generally based in one country, so you need to find only one suitable data source for your area. Any search engine will return several different data sources for this information, depending on your country, so you need to spend a little time looking at each one to determine which have APIs (that are usable) or, failing that, which can be scraped with the least amount of effort. Most people who use trains regularly have a routine, where they know the timetables, so the web sites of most interest are those that report the live information, including late trains and cancellations. In England, the foremost site is Live Departure Boards (www.nationalrail.co.uk/times_fares/ldb/), which provides reasonably accurate information about most trains on the U.K. network. It doesn’t include an API, unfortunately, but it is very easy to scrape for the current train times and also comes with a Twitter feed detailing the station closures and the overrunning of engineering works. It also has the advantage of using a basic GET request to retrieve the times of all the trains between two named stations, making it easier to bookmark. One journey I make on occasion is between St. Pancras and Luton Airport. CHAPTER 6 ■ DATA SOURCES 192 On reviewing the site, I can see that the URL incorporates both of these locations (?T=STP&S=LTN) and can be incorporated into the code 4 for whattrain.pl like this: my $agent = WWW::Mechanize->new(); $agent->get("http://realtime.nationalrail.co.uk/ldb/sumdep.aspx?T=STP&S=LTN"); Looking at the HTML source, there is a suspiciously unique row2 class identifier for each of the rows detailing the trains. So, you need a simple loop to check for this: while ($tag = $stream->get_tag("tr")) { if (defined $tag->[1]{class} && $tag->[1]{class} eq "row2") { which in turn allows you to read the information of each train: $stream->get_tag("td"); my $destination = $stream->get_trimmed_text("/td"); $stream->get_tag("td"); my $platform = $stream->get_trimmed_text("/td"); $stream->get_tag("td"); my $expectedTime = $stream->get_trimmed_text("/td"); $stream->get_tag("td"); my $arrivalTime = $stream->get_trimmed_text("/td"); This web site has been used, in part, because enough information is present to make automatic value judgments when it comes to catching the train. For example, knowing that it takes 35 minutes to travel from work to St. Pancras, you can exclude any train leaving in that window. Furthermore, by adding a grace period, you can limit the output to trains that will leave within ten minutes of you arriving at the station: my $graceMinutes = $minutesAway - $timeToStation; if ($graceMinutes >= $graceThreshold && $graceMinutes < $maxGracePeriod) { print "Get the $expectedTime to $destination from platform $platform."; print "There is $graceMinutes minutes grace.\n"; } This code can naturally be extended to swap the source and destination locations so that the return journey can be similarly considered. This could happen by looking at the time of day, for example. 4 This is screen-scraping Perl code, which may have broken by the time you read this. The relative pitfalls and considerations of this approach were covered earlier in the chapter. . providing your house with information about the real world, it is then able to make decisions for itself. This is the distinction between an automated home and a smart home. Why Data Is Important. (with a AW12 perhaps, as mentioned in Chapter 1) is in progress. CHAPTER 5 ■ COMMUNICATION 184 Conclusion With so many ways of communicating into and out of a system, you must begin with. of accesses per day or within a particular time frame. Usually these limits can be increased with the exchange of currency. If you are looking for APIs to experiment with, then a good starting