C h a p t e r 3 : Te x t P r o c e s s i n g 51 C h a p t e r 3 : Te x t P r o c e s s i n g 51 How to Use It To use this plug-in, pass it some text to truncate, the maximum number of allowed characters, and a symbol or string to attach to the end of the truncated string, like this: echo PIPHP_TextTruncate($text, 90, " "); You can choose any character or string for $symbol (or even the empty string) such as the useful HTML entity …, which displays an ellipsis made up of three periods—the standard notation to indicate that some text is missing. The Plug-in function PIPHP_TextTruncate($text, $max, $symbol) { $temp = substr($text, 0, $max); $last = strrpos($temp, " "); $temp = substr($temp, 0, $last); $temp = preg_replace("/([^\w])$/", "", $temp); return "$temp$symbol"; } Spell Check There’s a spell checking module available for PHP called pspell, but if it’s not already installed on your server, it needs downloading, installing, and configuring before you can use it. However, if you want to ensure your code will work on any server, this plug-in provides a reasonably fast spell checker based on a dictionary of over 80,000 words, which is supplied on the companion web site for this book (http://pluginphp.com) along with the plug-ins. Figure 3-8 again shows a paragraph from Dickens’ A Tale of Two Cities, but this time some deliberate spelling errors have been introduced, which have been caught by the plug-in. FIGURE 3-8 Checking user input for spelling is easily accomplished with this plug-in. 8 52 P l u g - i n P H P : 1 0 0 P o w e r S o l u t i o n s 52 P l u g - i n P H P : 1 0 0 P o w e r S o l u t i o n s About the Plug-in This plug-in takes a string variable containing text to spell check, along with a variable to determine how the resulting text should be displayed. It requires these arguments: • $text A string variable containing the text to be modified • $action A string variable which should contain a single letter text formatting tag Variables, Arrays, and Functions $filename String variable containing the path and name of the dictionary file to load $dictionary Array containing all the dictionary words $newtext String variable containing the transformed text $matches Array containing the results from the preg_ match() calls $offset Numeric variable pointer to the next word to check $word String variable containing the current word PIPHP_SpellCheckLoadDictionary() Function to load in the dictionary PIPHP_SpellCheckWord() Function to check a single word $top, $bot, $p Temporary variables used by PIPHP_ SpellCheckWord() to perform a binary search of the dictionary How It Works With this plug-in you get two for the price of one, because the main function, PIPHP_ SpellCheck(), relies on another function, PIPHP_SpellCheckWord(), to check individual words, and you can call PIPHP_SpellCheck() on its own, too. The very first thing the main function does is load the dictionary file into the array $dictionary. This file is on the web site and will be downloaded along with the plug-in. It comprises over 80,000 words separated by \r\n (carriage return and line feed) pairs. If you have your own collection of words, you can also use it as long as you make sure there’s a \r\n pair between each. This is also why you are provided with the function PHP_SpellChe ckLoadDictionary(), so you can specify the path and filename for such a file. With the dictionary loaded into an array by using the explode() function, $text has a space character appended to it. This is so the following code has a guaranteed non-word character at the end so a match can be made on the final word. Then, the two variables $newtext and $offset are initialized. Respectively, they contain the transformed text and a pointer to the next word to be checked in the string $text. The heart of the system comprises a while loop, which continues iterating through each word in $text until it reaches the end of the string, which the loop recognizes by checking $offset and seeing whether it is still less than the length of $text. C h a p t e r 3 : Te x t P r o c e s s i n g 53 C h a p t e r 3 : Te x t P r o c e s s i n g 53 Within the loop, each word is extracted in turn using the preg_match() function with a three-part regular expression: 1. [^\w]* This looks for zero or more non-word characters, followed by… 2. ([\w]+) … one or more word characters (a-z, A-z, or 0-9), followed by… 3. [^\w]+ … one or more non-word characters In part 2, above, the regular expression segment is surrounded by brackets, which means that particular value will be saved in the array element $matches[1][0], and its length in $matches[1][1]. The whole matched string, comprising all three parts, is saved in the array element $matches[0][1], and the length of this value is saved in $matches[0][1]. Provided with these values, the string variable $word is assigned just the part 2 match, which is the word to be spell checked. Then $offset, the pointer to the next word to be checked, is incremented by the length of the full matched string, so as to jump over any non-word characters. The code is then ready to process the following word the next time round the loop. In the meantime, the newly extracted word is passed to the function PIPHP_ SpellCheckWord(), along with the dictionary array to use, in $dictionary. The return value from this function is either TRUE if the word is found or FALSE if it isn’t. Depending on the value returned, the word is added to $newtext either with or without highlighting tags. Once execution exits from the loop, the text has been fully checked and so $newtext is returned, after passing it through the rtrim() function to remove the final space that was added at the function start. The function PIPHP_SpellCheckLoadDictionary() is next. It simply loads in the specified text file, explodes it into an array by splitting it at all the \r\n pairs, and then returns the new array. Finally, there’s the function PIPHP_SpellCheckWord(). This takes the arguments $word and $dictionary and then returns either TRUE or FALSE depending on whether or not the word is in the dictionary. This is done by means of a binary search in which the $dictionary array is continually bisected until a word is found, or is found to be missing. In a dictionary of 80,000 words or so, it will take no more than about 17 iterations maximum to drill down to where a word is (or should be), which is an order of magnitude faster than checking every word in the dictionary. By the way, this search relies on having a fully sorted list of words, so if you use your own word list, make sure you sort it alphabetically first. The way the plug-in performs the binary search is to say “Is the word I’m looking for in the top half or bottom half of this section of words?” Then, the loop goes around again splitting whichever half it determines the word to be in, asking the same question. This continues until the word is either found or determined not to be in the dictionary. The variables that control this divide-and-conquer method are $bot and $top, which represent the start and end positions to search between within the $dictionary array. Initially they are set to the first and last elements. Then, $bot is moved up or $top is moved down by taking the midway point between the two values and assigning that to a pivotal numeric variable called $p, right in the middle. If the word is greater than the one at position $p, then $bot is moved up past that word. If the word is lower than the one at position $p, then $top is dropped below that position. 54 P l u g - i n P H P : 1 0 0 P o w e r S o l u t i o n s 54 P l u g - i n P H P : 1 0 0 P o w e r S o l u t i o n s If at any point the word at position $p in the $dictionary array is the same as $word, then a match has been found and the value TRUE is returned. Otherwise, the process continues and eventually $top and $bot will pass each other and $bot will have a value higher than $top, because all the words in the dictionary have been checked, at which point the loop exits and the value FALSE is returned because no match was made. How to Use It To use the main function and have any misspelled words highlighted with underlines, you call it like this: echo PIPHP_SpellCheck($text, "u"); This will check the words in $text against all the dictionary words and highlight any that are not recognized. You can replace the “u” with “i” or “b” for italic or bold if you prefer. If you wish to spell check a single word, perhaps to support interactive spell checking, you must make sure you have loaded the dictionary in before calling the PIPHP_SpellCheckWord() function. Ideally, place the call to the function to do this somewhere at the start of your PHP file so you know for sure it has been loaded when you make a call. To load a dictionary file, use a command such as this: $dictionary = PIPHP_SpellCheckLoadDictionary("dictionary.txt"); Make sure you provide the correct file and pathname. If you are using the supplied plug-in from the web site, then dictionary.txt will be in the same directory as the plug-in. Then, to spell check an individual word, call the function like this: $result = PIPHP_SpellCheckWord($word, $dictionary); It will return TRUE if the word is recognized, or FALSE if it isn’t. The Plug-in function PIPHP_SpellCheck($text, $action) { $dictionary = PIPHP_SpellCheckLoadDictionary("dictionary.txt"); $text .= ' '; $newtext = ""; $offset = 0; while ($offset < strlen($text)) { $result = preg_match('/[^\w]*([\w]+)[^\w]+/', $text, $matches, PREG_OFFSET_CAPTURE, $offset); $word = $matches[1][0]; $offset = $matches[0][1] + strlen($matches[0][0]); if (!PIPHP_SpellCheckWord($word, $dictionary)) $newtext .= "<$action>$word</$action> "; else $newtext .= "$word "; } C h a p t e r 3 : Te x t P r o c e s s i n g 55 C h a p t e r 3 : Te x t P r o c e s s i n g 55 return rtrim($newtext); } function PIPHP_SpellCheckLoadDictionary($filename) { return explode("\r\n", file_get_contents($filename)); } function PIPHP_SpellCheckWord($word, $dictionary) { $top = sizeof($dictionary) -1; $bot = 0; $word = strtolower($word); while($top >= $bot) { $p = floor(($top + $bot) / 2); if ($dictionary[$p] < $word) $bot = $p + 1; elseif ($dictionary[$p] > $word) $top = $p - 1; else return TRUE; } return FALSE; } Remove Accents When you have data that is accented with diacritics such as the letter “é”, you sometimes need to convert this data to plain ASCII but still be able to read it. The solution is to replace all the diacritic characters with standard ones using this plug-in. Figure 3-9 shows some French text before and after running the plug-in. FIGURE 3-9 Part of the French Wikipedia entry for PHP before and after running it through this plug-in 9 . e r S o l u t i o n s About the Plug- in This plug- in takes a string variable containing text to spell check, along with a variable to determine how the resulting text should be displayed. It. containing the path and name of the dictionary file to load $dictionary Array containing all the dictionary words $newtext String variable containing the transformed text $matches Array containing. $text A string variable containing the text to be modified • $action A string variable which should contain a single letter text formatting tag Variables, Arrays, and Functions $filename String variable