Plug in PHP 100 POWER SOLUTIONS- P56 ppsx

C h a p t e r 1 0 : A P I s , R S S , a n d X M L 241 C h a p t e r 1 0 : A P I s , R S S , a n d X M L 241 Curl Get Contents Some web sites don’t like to be accessed by anything other than a web browser, which can make it difficult to fetch data from them with a PHP program using a function such as file_get_contents(). Such sites generally block your program by checking for a User Agent string, which is something all browsers send to web sites they visit and which can vary widely. They look something like this: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.1) Gecko/20090624 Firefox/3.5 (.NET CLR 3.5.30729) Therefore, to access these sites it is necessary to simulate being a browser, which, as shown in Figure 10-2, this plug-in will do for you. About the Plug-in This plug-in is intended to replace the PHP file_get_contents() function when used to fetch a web page. It accepts the URL of a page and a browser User Agent to emulate, and on success it returns the contents of the page at the given URL. On failure, it returns FALSE. It requires these arguments: • $url The URL to fetch • $agent The User Agent string of a browser FIGURE 10-2 This plug-in is used to fetch and display the www.pluginphp.com home page. 72 242 P l u g - i n P H P : 1 0 0 P o w e r S o l u t i o n s 242 P l u g - i n P H P : 1 0 0 P o w e r S o l u t i o n s Variables, Arrays, and Functions $ch CURL handle to an opened curl_init() session $result The returned result from the curl_exec() call How It Works This plug-in uses the Mod CURL (Client URL) library extension to PHP. If it fails, then you need to read your server and/or PHP installation instructions or consult your server administrator about enabling Mod CURL. What it does is open a session with curl_init(), passing a handle for the session to $ch. Such a session can perform a wide range of URL related tasks. But first the plug-in uses curl_setopt() to set up the various options required prior to making the curl_exec() call. These include setting CURLOPT_URL to the value of $url and CURLOPT_USERAGENT to the value of $agent. Additionally, a number of other options are set to sensible values. The curl_exec() function is then called, with the result of the call being placed in $result. The session is then closed with a call to curl_close(), and the value in $result is returned. How to Use It Using this plug-in is as easy as replacing calls to file_get_contents() with PIPHP_ CurlGetContents(). As long as you have also passed a sensible-looking User Agent string, the plug-in will then be able to return some pages that could not be retrieved using the former function call. For example, you can load in and display the contents of a web page like this: $agent = 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; ' . 'rv:1.9.1) Gecko/20090624 Firefox/3.5 (.NET CLR ' . '3.5.30729)'; $url = 'http://pluginphp.com'; echo PIPHP_CurlGetContents($url, $agent); This will display the main page of the www.pluginphp.com web site, which should look like Figure 10-2. There’s a comprehensive explanation (and collection) of User Agent strings at www.useragentstring.com. CAUTION Sometimes the reason a web site only allows a browser access to a web page is because other programs are not permitted to access it. So please check how you are allowed to access information from such a web site, and what you are allowed to do with it, before using this plug-in. The Plug-in function PIPHP_CurlGetContents($url, $agent) { $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_USERAGENT, $agent); curl_setopt($ch, CURLOPT_HEADER, 0); C h a p t e r 1 0 : A P I s , R S S , a n d X M L 243 C h a p t e r 1 0 : A P I s , R S S , a n d X M L 243 curl_setopt($ch, CURLOPT_ENCODING, "gzip"); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); curl_setopt($ch, CURLOPT_FAILONERROR, 1); curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 8); curl_setopt($ch, CURLOPT_TIMEOUT, 8); $result = curl_exec($ch); curl_close($ch); return $result; } Fetch Wiki Page Wikipedia is an excellent resource with several million articles. Even if you take into account that some of the information may not always be correct due to any user being able to edit a page, on the whole, most of the web site is factual and it contains a summary of almost the whole depth and breadth of human knowledge. What’s even better is that Wikipedia is published under the GNU Free Documentation License—see www.gnu.org/copyleft/fdl.html. Essentially this means that you can use any text from it as long you give full attribution of the source, and also offer the text (with any amendments) under the same license. As a consequence, I now have the entire Wikipedia database stored in my iPhone so that I can instantly look up any entry, even when mobile connectivity is limited. By using data compression techniques, and keeping only the main article text, it takes up just 2GB of space. The GFDL license used also means you can use programs such as this plug-in to reformat and reuse articles from Wikipedia, as shown in Figure 10-3, in which just the text has been extracted from its article on PHP. FIGURE 10-3 Using this plug-in, you can extract just the text from a Wikipedia entry. 73 244 P l u g - i n P H P : 1 0 0 P o w e r S o l u t i o n s 244 P l u g - i n P H P : 1 0 0 P o w e r S o l u t i o n s If you also take a look at Figure 10-4, you’ll see the original article at Wikipedia and, comparing the two, you’ll notice that the plug-in has completely ignored all the formatting, graphics, tables, and other extras, leaving behind just the text of the article. Using it you could create your own reduced size local copy of Wikipedia, or perhaps use it to add hyperlinks to words or terms you may wish to explain to your readers. I have used this code to add short encyclopedia entries to searches returned by a customized Google search engine I wrote. Combined with other plug-ins from this book, you could reformat articles into RSS feeds, translate them into “friendly” text, or, well, once you have access to the Wikipedia text, it’s really only up to your imagination what you choose to do with it. About the Plug-in This plug-in takes the title of a Wikipedia entry and returns just the text of the article, or on failure it returns FALSE. It requires this argument: • $entry A Wikipedia article title FIGURE 10-4 The original article about PHP on the Wikipedia web site Variables, Arrays, and Functions $agent String containing a browser User Agent string $url String containing the URL of Wikipedia’s XML export API $page String containing the result of fetching the Wikipedia entry $xml SimpleXML object created from $page $title String containing the article title as returned by Wikipedia C h a p t e r 1 0 : A P I s , R S S , a n d X M L 245 C h a p t e r 1 0 : A P I s , R S S , a n d X M L 245 $text String containing the article text $sections Array of four section headings at which to truncate the text $section String containing each element of $sections in turn $ptr Integer offset into $text indicating start of $section $data Array of search and replace strings for converting raw Wikipedia data $j Integer loop counter for processing search and replace actions $url String containing the URL of the original Wikipedia article How It Works Wikipedia has kindly created an API with which you can export selected articles from their database. You can access it at: http://en.wikipedia.org/wiki/Special:Export Unfortunately, they have set this API to deny access to programs that do not present it with a browser User Agent string. Luckily, the previous plug-in provides just that functionality, so using it, along with this plug-in, it’s possible to export any Wikipedia page as XML, which can then be transformed into just the raw text. This is done by setting up a browser User Agent string and then calling the Export API using PIPHP_CurlGetContents(), passing the Export API URL, along with the article title and the browser agent. Before making the call, though, $entry is passed though the rawurlencode() function to convert non URL-compatible characters into acceptable equivalents, such as spaces into %20 codes. The XML page returned from this call is then parsed into an XML object using the simplexml_load_string() function, the result being placed in $xml. Next, the only two items of information that are required, the article title and its text, are extracted from $xml->page->title and $xml->page->text, into $title and $text. Notice that all of this occurs inside a while loop. This is because by far the majority of Wikipedia articles are redirects from misspellings or different capitalizations. What the loop does is look for the string #REDIRECT in a response and, if one is discovered, the loop goes around again using the redirected article title, which is placed in $entry by using preg_ match() to extract it from between a pair of double square parentheses. The loop can handle multiple redirects, which are not as infrequent as you might think with the age of Wikipedia, and the amount of times many articles have been moved by now. So, with the raw Wikipedia text now loaded into $text, the next section truncates the string at whichever of five headings out of References, See Also, External Links, Notes, or Further Reading (if any) appears first, because those entries are not part of the main article and are to be ignored. This is done by using a foreach loop to iterate through the headings, which are enclosed by pairs of = symbols, Wikipedia’s markup to indicate an <h2> heading. Because some Wikipedia authors use spaces inside the ==, both cases (with and without spaces) are tested. Each heading in turn is searched for using the stripos() function and, if a heading is found in $text, $ptr will point to its start so that $text is then truncated to end at that position. . is necessary to simulate being a browser, which, as shown in Figure 10-2, this plug- in will do for you. About the Plug- in This plug- in is intended to replace the PHP file_get_contents() function. 245 $text String containing the article text $sections Array of four section headings at which to truncate the text $section String containing each element of $sections in turn $ptr Integer offset into. title FIGURE 10-4 The original article about PHP on the Wikipedia web site Variables, Arrays, and Functions $agent String containing a browser User Agent string $url String containing the URL of Wikipedia’s

Định dạng
Số trang	5
Dung lượng	528,69 KB