246 P l u g - i n P H P : 1 0 0 P o w e r S o l u t i o n s 246 P l u g - i n P H P : 1 0 0 P o w e r S o l u t i o n s Now that $text has the raw article we want, it’s time to convert Wikipedia’s special markup into the text and basic HTML this plug-in supports. Before writing this plug-in, I performed hours of searching to try and find other code already doing the job. And while there were a few examples, they were all quite long-winded and seemed overly complicated, which is why I chose to write my own code. In the end, it turned out that less than a couple of dozen rules were enough to make sense of most of Wikipedia’s markup. For example, you’ve already seen how ==Heading== stands for <h2>Heading</h2>. Similarly, ===Subheading=== stands for <h3>Subheading</h3>, and so on. While '''word''' (three single quotes on either side of some text) stands for <i>word</i> and ''word'' (two single quotes on either side of some text) stands for <b>word</b>. Ordered and unordered lists are also indicated by starting a new line with a # or a * symbol for each item, so for simplicity, I chose to convert both into the HTML bullet entity, ●, and treat nested lists as if they are on the same level. Tables begin by starting a newline with a { symbol, so the code ignores everything from \n{ up to a closing } symbol, and double newlines, \n\n, are converted into <p> tags. There’s also some more complicated markup such as [[Article]], meaning “Place a hyperlink here to Wikipedia’s article entitled Article,” or [[Article|Look at this]], which means “Add a hyperlink to Wikipedia’s article entitled Article here, but display the hyperlink text Look at this.” A few more variations on a theme exist here, plus there are several types of markup I chose to ignore such as [[Image…]], [[File…]], and [[Category…]], which contain material supplemental to the main text, and [http…] which contains hyperlinks I didn’t want to use. What’s more, there are also sections such as <gallery> and <ref>, which I decided should also be ignored, and some major sections appearing within the {{ and }} pairs of symbols, which are often nested with sub, and sub-sub sections. Again, all of these provide more rich content to a standard Wikipedia article, but are not necessary when we simply want the main text. Therefore, the $data array contains a sequence of regular expressions to be searched for, accompanied by strings with which to replace the matches. Using a for loop, the array is iterated through a pair at a time, passing each pair of strings to the preg_replace() function. If you want to learn more about the regular expressions used, there’s a lot of information at http://en.wikipedia.org/wiki/Regular_expression. Anyway, having massaged the text into almost plain text (with the exception of <h1> through <h7> headings, and the <p>, <br>, <b>, and <i> tags), the strip_tags() function is called to remove any other tags (except those just mentioned) that remain. Finally, before returning the article text, a notice and hyperlink are appended to it showing the original Wikipedia article from which the text was derived. In all, I think you’ll find that these rules handle the vast majority of Wikipedia pages very well, although you will encounter the odd page that doesn’t come out quite right. In such cases, you should be able to spot the markup responsible and add a translation for it into the $data array. If you use this plug-in on a production server, you’ll also need to comply with Wikipedia’s licensing requirements by adding a link to the GNU Free Documentation License, and indicating that your version of the article is also released under this license. For details, please see http://en.wikipedia.org/wiki/Wikipedia_Copyright. C h a p t e r 1 0 : A P I s , R S S , a n d X M L 247 C h a p t e r 1 0 : A P I s , R S S , a n d X M L 247 How to Use It To use this plug-in, just pass it a Wikipedia article title and you can display the result returned, like this: $result = PIPHP_FetchWikiPage('Climate Change'); if (!$result) echo "Could not fetch article."; else echo $result; Incidentally, I chose this article because it is one of those that returns the previously mentioned #REDIRECT string. In this case, Climate Change is redirected to Climate change (with a lowercase c in the second word), and serves to show that the code correctly handles redirects. Because Wikipedia makes use of the UTF-8 character set to enable all the different languages it supports, you may also need to ensure you include the following HTML <meta> tag in the <head> section of your HTML output, to ensure that all characters display correctly: <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> To save on thrashing Wikipedia’s servers and to also cut down on the programming power required on your own, you should definitely consider saving the result from each call to this plug-in, either as a text file or, preferably, in a MySQL database, and then serve up the cached copy whenever future requests are made for the same article. If you wish to compile your own database of Wikipedia articles using this plug-in, you can find all the various indexes at http://en.wikipedia.org/wiki/Portal:Contents. Remember, when you use this plug-in you must also copy and paste the PIPHP_ CurlGetContents() plug-in into your program, or otherwise include it, due to it being called by this plug-in. The Plug-in function PIPHP_FetchWikiPage($entry) { $agent = 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; ' . 'rv:1.9.1) Gecko/20090624 Firefox/3.5 (.NET CLR ' . '3.5.30729)'; $text = ''; while ($text == '' || substr($text, 0, 9) == '#REDIRECT') { $entry = rawurlencode($entry); $url = "http://en.wikipedia.org/wiki/Special:Export/$entry"; $page = PIPHP_CurlGetContents($url, $agent); $xml = simplexml_load_string($page); $title = $xml->page->title; $text = $xml->page->revision->text; 248 P l u g - i n P H P : 1 0 0 P o w e r S o l u t i o n s 248 P l u g - i n P H P : 1 0 0 P o w e r S o l u t i o n s if (substr($text, 0, 9) == '#REDIRECT') { preg_match('/\[\[(.+)\]\]/', $text, $matches); $entry = $matches[1]; } } $sections = array('References', 'See also', 'External links', 'Notes', 'Further reading'); foreach($sections as $section) { $ptr = stripos($text, "==$section=="); if ($ptr) $text = substr($text, 0, $ptr); $ptr = stripos($text, "== $section =="); if ($ptr) $text = substr($text, 0, $ptr); } $data = array('\[{2}Imag(\[{2})*.*(\]{2})*\]{2}', '', '\[{2}File(\[{2})*.*(\]{2})*\]{2}', '', '\[{2}Cate(\[{2})*.*(\]{2})*\]{2}', '', '\{{2}([^\{\}]+|(?R))*\}{2}', '', '\'{3}(.*?)\'{3}', '<b>$1</b>', '\'{2}(.*?)\'{2}', '<i>$1</i>', '\[{2}[^\|\]]+\|([^\]]*)\]{2}', '$1', '\[{2}(.*?)\]{2}', '$1', '\[(http[^\]]+)\]', ' ', '\n(\*|#)+', '<br /> ● ', '\n:.*?\n', '', '\n\{[^\}]+\}', '', '\n={7}([^=]+)={7}', '<h7>$1</h7>', '\n={6}([^=]+)={6}', '<h6>$1</h6>', '\n={5}([^=]+)={5}', '<h5>$1</h5>', '\n={4}([^=]+)={4}', '<h4>$1</h4>', '\n={3}([^=]+)={3}', '<h3>$1</h3>', '\n={2}([^=]+)={2}', '<h2>$1</h2>', '\n={1}([^=]+)={1}', '<h1>$1</h1>', '\n{2}', '<p>', '<gallery>([^<]+?)<\/gallery>', '', '<ref>([^<]+?)<\/ref>', '', '<ref [^>]+>', ''); for ($j = 0 ; $j < count($data) ; $j += 2) $text = preg_replace("/$data[$j]/", $data[$j+1], $text); $text = strip_tags($text, '<h1><h2><h3><h4><h5><h6><h7>' . '<p><br><b><i>'); $url = "http://en.wikipedia.org/wiki/$title"; $text .= "<p>Source: <a href='$url'>Wikipedia ($title)</a>"; return trim($text); } C h a p t e r 1 0 : A P I s , R S S , a n d X M L 249 C h a p t e r 1 0 : A P I s , R S S , a n d X M L 249 Fetch Flickr Stream If you enjoy looking at photographs, chances are you have used the Flickr photo sharing service and may also have discovered a few photographers whose Flickr streams you like to follow. Well, now you can offer the same facility to your users with this plug-in. Using it you can look up any public Flickr stream and return the (up to) 20 most recent photographs from it. Figure 10-5 shows the result of pointing the plug-in at a new account I created at Flickr. In this instance, I chose to display links to the photos, but you can also embed them in your web pages if you wish. About the Plug-in This plug-in takes the name of a public Flickr account and returns the most recent photos. Upon success, it returns a two-element array, the first of which is the number of photos returned, and the second is an array containing URLs for each photo. On failure, it returns a single-element array with the value FALSE. It requires this argument: • $account A Flickr account name such as xxxxxxxx@Nxx (where the x symbols represent digits), or the more friendly Flickr usernames such as mine, which is robinfnixon Variables, Arrays, and Functions $url String containing the Flickr photo stream base URL $page String containing the Flickr stream HTML page contents $rss String containing the location of the RSS feed for $page $xml String containing the contents of $rss $sxml SimpleXML object created from $xml $pics Array containing the image URLs $item SimpleXML object extracted from item in $sxml $j Integer loop variable for iterating through image URLs $t String used for transforming URLs into the form required FIGURE 10-5 With this plug-in you can view the stream of a public Flickr user. 74 250 P l u g - i n P H P : 1 0 0 P o w e r S o l u t i o n s 250 P l u g - i n P H P : 1 0 0 P o w e r S o l u t i o n s How It Works This plug-in takes the base Flickr stream URL and appends the account name in $account to it. This HTML page is then returned using the file_get_contents() function and its contents are stored in $page. The @ symbol prefacing the function suppresses any error messages should the call fail. And, if it does fail, a value of FALSE is returned in a single-element array. Next, the array that will hold the image URLs, $pics, is initialized and the program screen scrapes the HTML page to locate the position of the RSS link within it. Screen scraping is the term given to the process of extracting information from HTML pages that hasn’t been explicitly provided to you in an API or via another method. Actually, there are Flickr APIs to do this, but these three lines of code are simpler and represent all the coding required to find the RSS feed on the page and return its URL to the variable $rss. Using this URL, the RSS feed is fetched and placed in the string $xml, from where it is transformed into a SimpleXML object in $sxml. This is a DOM (Document Object Model) object that can be easily traversed. To do this, a foreach loop iterates through the items in $sxml->entry, placing each in a new object called $item. Then a for loop is used to iterate though all the items in $item->link, which contains the URLs we are interested in. If $item->link[$j]['type'] has the value image, then $item->link[$j]['href'] will contain a URL, so this is extracted into the variable $t, first removing any _t or _m sequences from the URL, since they represent different sizes of the photo that we are not interested in. Once $t contains the URL wanted, its value is assigned to the next available element of the $pics array and the foreach loop continues. The plug-in returns a two-element array with the first element containing the number of photos found, calculated using the count() function, and the second containing an array of the photo URLs. Figure 10-6 shows a photo taken at random from the list returned and entered into a browser. In this case, it has the following Flickr URL: http://farm3.static.flickr.com/2522/3708788611_5a9964f24d_o.jpg FIGURE 10-6 The plug-in determines the exact URL required for each photo. . plug- in into your program, or otherwise include it, due to it being called by this plug- in. The Plug- in function PIPHP_FetchWikiPage($entry) { $agent = 'Mozilla/5.0 (Windows; U; Windows. remain. Finally, before returning the article text, a notice and hyperlink are appended to it showing the original Wikipedia article from which the text was derived. In all, I think you’ll find. Wikipedia’s special markup into the text and basic HTML this plug- in supports. Before writing this plug- in, I performed hours of searching to try and find other code already doing the job. And while