C h a p t e r 7 : T h e I n t e r n e t 161 C h a p t e r 7 : T h e I n t e r n e t 161 using a loop to iterate through each element of $data, the much faster and more efficient array_map() function is called. This does the same thing, only requiring the name of a function to call for each element. In the case of populating the $left array, which will be assigned all the left halves of each line, the function PIPHP_PU_F1() is called. For the $right array, PIPHP_PU_F2() is called. The reason for the split is that the checksum and URL are stored side by side on a line, separated only by the token !1!, which is unlikely to appear in any URL. A for loop is then started to iterate through the $left array and check whether $page already exists in the datafile. If so, $exists is set to point to the element number within the array where it is located. Using this pointer, the matching element in $right is compared with the value of $checksum and, if it is the same, zero is returned to indicate that the page is still the same as last time the program checked. If, on the other hand, $page exists in the datafile but $checksum does not match the saved value, then the page contents must have changed. In this case, the old checksum value in the datafile is overwritten with the new value in $checksum using the str_replace() function, the datafile is saved back to disk, and a value of 1 is returned to indicate that the web page has changed. At the end of the if (file_exists($datafile)) set of statements, if the file does not already exist, then the string $rawfile is assigned the empty string. Finally, whether or not the file exists, the contents of $rawfile are saved to disk, along with the values of $page and $checksum, separated by the token !1!. This has the effect of either creating the datafile if it doesn’t exist, or if it does, a new line of data is appended to it, followed by a \n newline character. Either way, a value of -1 is returned to indicate that the URL in $page was new to the datafile and has now been saved. Note that the two functions PIPHP_PU_F1() and PIPHP_PU_F2() are for the exclusive use of the main plug-in and are not intended to be called elsewhere. How to Use It To use this plug-in, call it like this: $page = "http://pluginphp.com"; $datafile = "urldata.txt"; $result = PIPHP_PageUpdated($page, $datafile); Then, to act on the value in $result, you might use code such as this: echo "<pre>(1st call) The URL '$page' is "; if ($result == -1) echo "New"; elseif ($result == 1) echo "Changed"; elseif ($result == 0) echo "Unchanged"; else echo "Inaccessible"; This will tell you (or your users) whether the index page at www.pluginphp.com has changed since the last time it was checked, or whether it is new to the datafile or even inaccessible. The first time you make the call regarding a new page it will always report that the page is new. If you try an additional call (such as via the following code) immediately 162 P l u g - i n P H P : 1 0 0 P o w e r S o l u t i o n s 162 P l u g - i n P H P : 1 0 0 P o w e r S o l u t i o n s after on a site that is not dynamically generated, you will then be informed that the page is unchanged, otherwise you’ll be told it has changed: $result = PIPHP_PageUpdated($page, $datafile); echo "<br />(2nd call) The URL '$page' is "; if ($result == -1) echo "New"; elseif ($result == 1) echo "Changed"; elseif ($result == 0) echo "Unchanged"; else echo "Inaccessible"; You might prefer to send an e-mail instead of displaying this information to a browser, in which case just replace the echo statements with a call to plug-in 38, PIPHP_SendEmail(), sending the contents of the echo statements in the $message argument. The Plug-in function PIPHP_PageUpdated($page, $datafile) { $contents = @file_get_contents($page); if (!$contents) return FALSE; $checksum = md5($contents); if (file_exists($datafile)) { $rawfile = file_get_contents($datafile); $data = explode("\n", rtrim($rawfile)); $left = array_map("PIPHP_PU_F1", $data); $right = array_map("PIPHP_PU_F2", $data); $exists = -1; for ($j = 0 ; $j < count($left) ; ++$j) { if ($left[$j] == $page) { $exists = $j; if ($right[$j] == $checksum) return 0; } } if ($exists > -1) { $rawfile = str_replace($right[$exists], $checksum, $rawfile); file_put_contents($datafile, $rawfile); return 1; } } else $rawfile = ""; file_put_contents($datafile, $rawfile . "$page!1!$checksum\n"); C h a p t e r 7 : T h e I n t e r n e t 163 C h a p t e r 7 : T h e I n t e r n e t 163 return -1; } function PIPHP_PU_F1($s) { list($a, $b) = explode("!1!", $s); return $a; } function PIPHP_PU_F2($s) { list($a, $b) = explode("!1!", $s); return $b; } HTML To RSS The popularity of RSS (Really Simple Syndication) feeds is still growing due to the ease with which you can subscribe to a feed and have updates automatically sent to the feed reader. In fact, most decent browsers also offer RSS reading facilities. But what if you’re too busy developing the HTML portion of your site to start building RSS feeds? Or what if you’d like to be able to view other web sites in RSS? The solution comes with this plug-in, which will fetch a web page, analyze it, strip out non-essential and formatting items, and reformat it into RSS (see Figure 7-8 for an example). FIGURE 7-8 The plug-in is used to output the McGraw-Hill web site as an RSS feed. 48 164 P l u g - i n P H P : 1 0 0 P o w e r S o l u t i o n s 164 P l u g - i n P H P : 1 0 0 P o w e r S o l u t i o n s About the Plug-in This plug-in accepts a string containing the HTML to be converted, along with other required arguments, and returns a properly formatted RSS document. It takes these arguments: • $html The HTML to convert • $title The RSS feed title to use • $description The RSS description to use • $url The URL to which the feed should link • $webmaster The e-mail address of the responsible webmaster • $copyright The copyright details Variables, Arrays, and Functions $date String containing the date in RSS-compatible form $dom Document object of $contents $xpath XPath object for traversing $dom $hrefs Object containing all a href= link elements in $dom $links Array of all the links discovered in $url $to Array containing the version of what each $link should be changed to in order to ensure it is absolute $count Integer containing the number of elements in $to $j Integer counter for iterating through $hrefs and $to $link Each link in turn extracted from $links $temp Non-tokenized copy of $link PIPHP_RelToAbsURL() Plug-in 21: This function converts a relative URL to absolute. How It Works This plug-in starts by setting the string variable $date to the current date and time, in a format that is acceptable to RSS readers. Then all instances of & (the XML and XHTML required form of the & symbol) are converted to just the & symbol, and then all & symbols are changed to a special token with the value !!**1**!!. As described in plug-in 46, this is done because the str_replace() function seems to have a bug relating to the use of the & symbol, so the token is substituted to avoid it. The & symbols will be swapped back later. After that, the code has much in common with many of the other plug-ins in this chapter in that it must traverse an HTML DOM (Document Object Model), ensuring all a href= links are in absolute format. It does this by creating a new DOM object in $dom and then loading it up with the HTML tags from $html. Then a new XPath object is created in $xpath. This is used by $xpath->evaluate to extract all the a href= tags into the $hrefs array. Next the arrays $links and $to are initialized. These will respectively contain all the encountered links and the absolute forms to which they should be changed. A counter that will index into these arrays, $count, is also initialized. C h a p t e r 7 : T h e I n t e r n e t 165 C h a p t e r 7 : T h e I n t e r n e t 165 A for loop is then used to extract the links from each a href= tag into the array $links, which then has all duplicates removed using the array_unique() function. This simply removes duplicates in place so the array can be sorted so that all elements are stored contiguously. A foreach loop is then used to iterate through each link, first checking that a link actually has been assigned a value. If it has, the string variable $temp is assigned a version of $link without any !!**1**!! tokens that may have replaced any & symbols. This ensures a properly formed URL is ready for converting to absolute format using the PIPHP_ RelToAbsURL() function, and for assigning to an element in the $to array. Again, as in plug-in 46, tokens are substituted for all links within the main document to prevent potential clashes during multiple replace operations. Every form of allowable link is substituted, whether single, double, or unquoted: href="link", href='link', and href=link. The tokens take the form !!$count!! and therefore start at !!0!! and proceed on through !!1!! and so on each time a new link is substituted. Once all the tokens are in place in the document, and there is no chance of clashes during string substitutions, a for loop is used to convert them into the absolute URLs held in the $to array. Next, any encoded URLs in which http:// has been turned into http%3A%2F%2F are restored back to http://, any & symbols are restored back from the token !!**1**!!, and all whitespace is removed from the document using the preg_replace() function with a parameter of /[\s]+/, which forces all consecutive strings of one or more whitespace characters to be replaced with a single space. The next lines strip out any <script> and <style> tags and their contents, followed by ensuring that all <h> tags have their contents removed. This is done so a conversion can easily be made later into RSS headers. With those tags removed, all remaining tags are also stripped out, with the exception of those listed in the string $ok. This process is handled by the function strip_tags(). In case you’re wondering, I tried to also remove the <script> and <style> tags using strip_ tags(), but the function seems buggy and would not always remove them, so that’s why these are handled separately. After that, all remaining HTML characters are replaced with their RSS equivalents; so, for example, the < symbol becomes <, the > becomes >, and so on. The final two preg_replace() calls substitute the two opening and closing forms of the <h> tag, which previously had any contents stripped out, into the XML required for properly formatted RSS headers. In other words, this plug-in assumes that anything between <h> and </h> tags should be treated as RSS headers. Finally, the RSS itself is returned within a return <<<_END _END construct, where you can see $title, $url, $description, and all the other variables in their correct places, all the way down to $html, the main contents of the feed on which this plug-in has performed all the processing. How to Use It When you want to convert HTML to RSS, you can use code such as the following, in which your web site domain is assumed to be myserver.com: $html = "Your HTML content goes here"; $title = "RSS version of my webpage"; . converting to absolute format using the PIPHP_ RelToAbsURL() function, and for assigning to an element in the $to array. Again, as in plug- in 46, tokens are substituted for all links within the main. to in order to ensure it is absolute $count Integer containing the number of elements in $to $j Integer counter for iterating through $hrefs and $to $link Each link in turn extracted from $links $temp. object for traversing $dom $hrefs Object containing all a href= link elements in $dom $links Array of all the links discovered in $url $to Array containing the version of what each $link should be