1. Trang chủ
  2. » Công Nghệ Thông Tin

Mastering Algorithms with Perl phần 6 ppsx

74 232 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 74
Dung lượng 505,97 KB

Nội dung

foreach $c ( @S ) { $sum += $c * $power; $power *= $Sigma; } * Or rather, the code shows the iterative formulation of it: the more mathematically minded may prefer c n x n + c n-1 x n-1 + . . . + c 2 x 2 + c 1 x + c 0 = ( ( . . . (c n x + c n-1 )x + . . .)x + c 1 )x + c 0 . Page 364 But this is silly: for n occurrences of $c, (n is scalar @S, the size of @S) this performs n additions and 2n multiplications. Instead of that we can get away with only n multiplications (and the $power is not needed at all): $sum = 0; foreach $c ( @S ) { $sum *= $Sigma; $sum += $c; } This trick is the Horner's rule. Within the loop, perform one multiplication (instead of the two) first, and then one addition. We can further eliminate one of the multiplications, the useless multiplication of zero: $sum = $S[0]; foreach $c ( @S[ 1 $#S ] ) { $sum *= $Sigma; $sum += $c; } So from 2n + 2 assignments (counting *= and *= as assignments), n additions and 2n multiplications, we have reduced the burden to 2n - 1 assignments, n - 1 additions, and n - 1 multiplications. Having processed the pattern, we advance through the text one character at a time, processing each slice of m characters in the text just like the pattern. When we get identical numbers, we are bound to have a match because there is only one possible combination of multipliers that can produce the desired number. Thus, the multipliers (characters) in the text are identical to the multipliers in the pattern. Handling Huge Checksums The large checksums cause trouble with Perl because it cannot reliably handle such large integers. Perl guarantees reliable storage only for 32-bit integers, covering numbers up to 2 32 - 1. That translates into 4 (8-bit) characters. After that number, Perl silently starts using floating point numbers which cannot guarantee exact storage. Large floating point numbers start to lose their less significant digits, making tests for numeric equality useless. Rabin and Karp proposed using modular arithmetic to handle these large numbers. The checksums are computed in modulo q. q is a prime such that ( | Σ Σ | + 1)q is still below the maximum integer the system can handle. More specifically, we want to find the largest prime number q that satisfies (256 + 1) q < 2, 147, 483, 647. The reason for using 2, 147, 483, 647, 2 31 - 1, instead of 4,294,967,295, 2 32 - 1, will be explained shortly. The prime we are looking for is 8,355,967. (For more information about finding primes, see the section "Primecontinue Page 365 Numbers" in Chapter 12, Number Theory.) If, after each multiplication and sum, we calculate the result modulo 8,355,967, we are guaranteed never to surpass 2,147,483,647. Let's try this, taking the modulo whenever the number is about to "escape." "ABCDE" == 65 * (256**4 % 8355967) + 66 * (256**3 % 8355967) + 67 * (256**2 % 8355967) + 68 * 256 + 69 == 65 * 16712192 + 66 * 65282 + 67 * 65536 + 68 * 256 + 69 == == 377804 We may check the final result (using for example Math::BigInt) and see that 280,284,578,885 modulo 8,355,967 does indeed equal 377,804. The good news is that the number now stays manageable. The bad news is that our problem just moved, it didn't go away. Using the modulus means that we can no longer be absolutely certain of our match. a = b mod c does not mean that a = b. For example, 23 = 2 mod 7, but very clearly 23 does not equal 2. In matching terms, this means that we might encounter false hits. The estimated number of false hits is O (n/q), so using our q = 8,355,967 and assuming the pattern to be shorter than or equal to 15 in length, we should expect less than one match in a million to be false. As an example, we match the pattern dabba from the text abadabbacab (see Figure 9-1.) First the Rabin-Karp sum of the pattern is computed, then T is sliced m characters at a time and the Rabin-Karp sum of each slice is computed. Implementing Rabin-Karp Our implementation of Rabin-Karp can be called in two ways, for computing either a total sum or an incremental sum. A total sum is computed when the sum is returned at once for a whole string: this is how the sum is computed for a pattern or for the $m first characters of the text. The incremental method uses an additional trick: before bringing in the next character using Horner's rule, it removes the contribution of the highest "digit" from the previous round by subtracting the product of the previously highest digit and the highest multiplier, $hipow. In other words, we strip the oldest character off the back and load a new character on the front. This trick rids us of always having to compute the checksum of $m characters all over again. Both the total and the incremental ways use Horner's rule.break Page 366 Figure 9-1. Rabin-Karp matching my $NICE_Q = 8355967; # rabin_karp_sum( $S, $q, $n ) # # $S is the string to be summed # $q is the modulo base (default $NICE_Q) # $n is the (prefix) length of the string to summed (default length($S)) sub rabin_karp_sum_modulo_q { my ( $S ) = shift; # The string. use integer; # We use only integers. my $q = @_ ? shift : $NICE_Q; my $n = @_ ? shift : length( $S ); my $Sigma = 256; # Assume 8-bit text. my ( $i, $sum, $hipow ); if ( @_ ) { # Incremental summing. ( $i, $sum, $hipow ) = @_; if ($i > 0) { my $hiterm; # The contribution of the highest digit. $hiterm = $hipow * ord( substr( $S, $i - 1, 1 ) ); $hiterm %= $q; $sum -= $hiterm; } $sum *= $Sigma; $sum += ord( substr( $S, $n + $i - 1, 1 ) ); $sum %= $q; return $sum; # The sum. } else { # Total summing. ( $sum, $hipow ) = ( ord( substr( $S, 0, 1 ) ), 1 ); Page 367 for ( $i = 1; $i < $n; $i++ ) { $sum *= $Sigma; $sum += ord( substr( $S, $i, 1 ) ); $sum %= $q; $hipow *= $Sigma; $hipow %= $q; } # Note that in array context we return also the highest used # multiplier mod $q of the digits as $hipow, # e.g., 256**4 mod $q == 3599 for $n == 5. return wantarray ? ( $sum, $hipow ) : $sum; } } Now let's use the algorithm to find a match:break sub rabin_karp_modulo_q { my ( $T, $P, $q ) = @_; # The string, pattern, and optional modulo. use integer; my $n = length( $T ); my $m = length( $P ); return -1 if $m > $n; return 0 if $m == $n and $P eq $T; $q = $NICE_Q unless defined $q; my ( $KRsum_P, $hipow ) = rabin_karp_sum_modulo_q( $P, $q, $m ); my ( $KRsum_T ) = rabin_karp_sum_modulo_q( $T, $q, $m ); return 0 if $KRsum_T == $KRsum_P and substr( $T, 0, $m ) eq $P; my $i; my $last_i = $n - $m; # $i will go from 1 to $last_i. for ( $i = 1, $i <= $last_i; $i++ ) { $KRsum_T = rabin_karp_sum_modulo_q( $T, $q, $m, $i, $KRsum_T, $hipow ); return $i if $KRsum_T == $KRsum_P and substr( $T, $i, $m ) eq $P; } return -1; # Mismatch. } Page 368 If asked for a total sum, rabin_karp_sum_modulo_q($S, $n, $q) computes for the $S the sum of the first $n characters in modulo $q. If $n is not given, the sum is computed for all the characters in the first argument. If $q is not given, 8355967 is used. The subroutine returns the (modular) sum or, in list context, both the sum and the highest used power (by the appropriate modulus). For example, with n = 5, the highest used power is 256 5-1 mod 8,355,967 = 3,599, assuming that | Σ Σ | = 256. If called for an incremental sum, rabin_karp_sum_modulo_q($S, $q, $i, $n, $sum, $hipow) computes for $S the sum modulo $q for the characters from the $i $i+$n. The $sum is used both for input and output: on input it's the sum so far. The $hipow must be the highest used power returned by the initial total summing call. Further Checksum Experimentation As a checksum algorithm, Rabin-Karp can be improved. We experiment a little more in the following two ways. The first idea: one can trivially turn modular Rabin-Karp into a binary mask Rabin-Karp. Instead of using a prime modulus, use an integer of the form 2 k-1 - 1, for example 2 31 - 1 = 2, 147, 483, 647, and replace all modular operations by a binary mask: & 2147483647. This way only the 31 lowest bits matter and any overflow is obliterated by the merciless mask. However, benchmarking the mask version against the modular version shows no dramatic differences—a few percentage points depending on the underlying operating system and CPU. Then to our second variation. The original Rabin-Karp algorithm without the modulus is by its definition more than a strong checksum: it's a one-to-one mapping between a string (either the pattern or a substring of the text) and a number. * The introduction of the modulus or the mask weakens it down to a checksum of strength $q or $mask; that is, every $qth or $maskth potential match will be a false one. Now we see how much we gave up by using 2,147,483,647 instead of 4,294,967,295. Instead of having a false hit every 4 billionth character, we will experience failure every 2 billionth character. Not a bad deal. For the checksum, we can use the built-in checksum feature of the unpack() function. The whole Rabin-Karp summing subroutine can be replaced with one unpack("%32C*") call. The %32 part indicates that we want a 32-bit (32) checksum (%) and the C* part tells that we want the checksum over all (*) the characters (C). This time we do not have separate total and incremental versions, just a total sum.break * A checksum is strong if there are few (preferably zero) checksum collisions, inputs reducing to identical checksums. Page 369 sub rabin_karp_unpack_C { my ( $T, $P ) = @_; # The text and the pattern. use integer; my ( $KRsum_P, $m ) = ( unpack( "%32C*", $P ), length($P) ); my ( $i ); my ( $last_i ) = length( $T ) - $m; for ( $i = 0; $i <= $last_i; $i++ ) { return $i if unpack( "%32C*", substr( $T, $i, $m ) ) == $KRsum_P and substr( $T, $i, $m ) eq $P; } return -1; # Mismatch. } This is fast, because Perl's checksumming is very fast. Yet another checksum method is the MD5 module, written by Gisle Aas and available from CPAN. MD5 is a cryptographically strong checksum: see Chapter 13 for more information. The 32-bit checksumming version of Rabin-Karp can be adapted to comparing sequences. We can concatenate the array elements with a zero byte ("\0") using join(). This doesn't guarantee us uniqueness, because the data might contain zero bytes, so we need an inner loop that checks each of the elements for matches. If, on the other hand, we know that there are no zero bytes in the input, we know immediately after a successful unpack() match that we have a true match. Any separator guaranteed not to be in the input can fill the role of the "\0". Rabin-Karp would seem to be better than the naïve matcher because it processes several characters in one stride, but its worst-case performance is actually just as bad as that of the naïve matcher: ΘΘ ( (n - m + 1) m). In practice, however, false hits are rare (as long as the checksum is a good one), and the expected performance is O (n + m). If you are familiar with how data is stored in computers, you might wonder why you'd need to go the trouble of checksumming with Rabin-Karp. Why not just compare the string as 32-bit integers? Yes, deep down that is very efficient, and the standard libraries of many operating systems have well tuned assembler language subroutines that do exactly that. However, the string is unlikely to sit neatly at 32-bit boundaries, or 64-bit boundaries, or any nice and clean boundaries we would like them to be sitting at. On the average, three out of four patterns will straddle the 32-bit limits, so the brute-force method of matching 32-bit machine words instead of characters won't work.break Page 370 Knuth-Morris-Pratt The obvious inefficiency of both the naïve matcher and Rabin-Karp is that they back up a lot: on a false match the process starts again with the next character immediately after the current one. This may be a big waste, because after a false hit it may be possible to skip more characters. The algorithm for this is the Knuth-Morris-Pratt and the skip function is called the prefix function. Although it is called a function, it is just a static integer array of length m + 1. Figure 9-2 illustrates KMP matching. Figure 9-2. Knuth-Morris-Pratt matching The pattern character a fails to match the text character b. We may in fact slide the pattern forward by 3 positions, which is the next possible alignment of the first character (a). (See Figure 9-3.) The Knuth-Morris-Pratt prefix function will encode these maximum slides. Figure 9-3. Knuth-Morris-Pratt matching: large skip We will implement the Knuth-Morris-Pratt prefix function using a Perl array, @next. We define $next[$j] to be the maximum integer $k, less than $j, such that the suffix of length $k - 1 is still a proper suffix of the pattern. This function can be found by sliding the pattern over itself, as we'll show in Figure 9-4. In Figure 9-3, if we fail at pattern position $j = 1, we may skip forward only by 0 1 = 1 character, because the next character may be an a for all we know. Oncontinue Page 371 Figure 9-4. KMP prefix function for "acabad" the other hand, if we fail at pattern position $j = 2, we may skip forward by 2 1 = 3 positions, because for this position to have an a starting the pattern anew there couldn't have been a mismatch. With the example text "babacbadbbac", we get the process in Figure 9-5. The upper diagram shows the point of mismatch, and the lower diagram shows the comparison point just after the forward skip by 3. We skip straight over the c and b and hope this new a is the very first character of a match. Figure 9-5. KMP prefix function in action The code for Knuth-Morris-Pratt consists of two functions: the computation of the prefix function and the matcher itself. The following example illustrates the computation of the prefix:break sub knuth_morris_pratt_next { my ( $P ) = @_; # The pattern. use integer; Page 372 my ($m, $i, $j ) = ( length $P, 0, -1 ); my @next; for ($next[0] = -1; $i < $m; ) { # Note that this while() is skipped during the first for() pass. while ( $j > -1 && substr( $P, $i, 1 ) ne substr( $P, $j, 1 ) ) { $j = $next[ $j ]; } $i++; $j++; $next[ $i ] = substr( $P, $j, 1 ) eq substr( $P, $i, 1 ) ? $next[ $j ] : $j; } return ( $m, @next ); # Length of pattern and prefix function. } The matcher looks disturbingly similar to the prefix function computation. This is not accidental: both the prefix function and the Knuth-Morris-Pratt itself are finite automata, algorithmic creatures that can be used to build complex recognizers known as parsers. We will explore finite automata in more detail later in this chapter. The following example illustrates the matcher: sub knuth_morris_pratt { my ( $T, $P ) = @_; # Text and pattern. use integer; my $m = knuth_morris_pratt_next( $P ); my ( $n, $i, $j ) = ( length($T), 0, 0 ); my @next; while ( $i < $n ) { while ( $j > -1 && substr( $P, $j, 1 ) ne substr( $T, $i, 1 ) ) { $j = $next[ $j ]; } $i++; $j++; return $i - $j if $j >= $m; # Match. } return -1; # Mismatch. } The time complexity of Knuth-Morris-Pratt is O (m + n). This follows very simply from the obvious O (m) complexity for computing the prefix function and the O (n) for the matching process itself.break Page 373 Boyer-Moore The Boyer-Moore algorithm tries to skip forward in the text even faster. It does this by using not one but two heuristics for how fast to skip. The larger of the proposed skips wins. Boyer-Moore is the most appropriate algorithm if the pattern is long and the alphabet ΣΣ is large, say, when m > 5 and the | ΣΣ | is several dozen. In practice, this means that when matching normal text, use the Boyer-Moore. And Perl does exactly that. The basic structure of Boyer-Moore resembles the naïve matcher. There are two main differences. First, the matching is done backwards, from the end of the pattern towards the beginning. Second, after a failed attempt, Boyer-Moore advances by leaps and bounds instead of just one position. At top speed only every mth character in the text needs to be examined. Boyer-Moore uses two heuristics to decide how far to leap: the bad-character heuristic, also called the (last) occurrence heuristic, and the good-suffix heuristic, also called the match heuristic. Information for each heuristic is maintained in an array built at the beginning of the matching operation. The bad-character heuristic indicates how much you can safely jump forward in the text after a mismatch. The heuristic is an array in which each position represents a character in | ΣΣ | and each value is the minimal distance from that character to the end of the pattern (when a character appears more than once in a pattern, only the last occurrence matters). In our pattern, for instance, the last a is followed by one more character, so the position assigned to a in the array contains the value 1: pattern position 0 1 2 3 4 pattern character d a b a b character a b c d bad-character heuristic 1 0 5 4 The earlier a character occurs in the pattern, the farther a mismatch caused by that character allows us to skip. Mismatch characters not occurring at all in the pattern allow us to skip with maximal speed. The heuristic requires space of | ΣΣ |. We made our example fit the page by assuming a | ΣΣ | of just 4 characters. The good-suffix heuristic is another way to tell how many characters we can safely skip if there isn't a match—the heuristic is based on the backward matching order of Boyer-Moore (see the example shortly). The heuristic is stored in an array in which each position represents a position in the pattern. It can be found bycontinue [...]... consonants in the word: Number Consonant 1 BPFV 2 CSGJKQXZ 3 DT 4 L 5 MN 6 R The letters A, E, I, O, U, Y, H, and W are not coded (yes, all vowels are considered irrelevant) Here are more examples of soundex transformation:break Page 389 Heilbronn Hilbert Perl pearl peril prowl puerile HLBR HLBR PRL PRL PRL PRL PRL H4 16 H4 16 P64 P64 P64 P64 P64 Text::Metaphone The Text::Metaphone module, implemented by Michael... the code was reimplemented in C (via the XS mechanism) instead of Perl to gain extra speed.break Page 388 Phonetic Algorithms This section discusses phonetic algorithms, a family of string algorithms that, like approximate/fuzzy string searching, make life a bit easier when you're trying to locate something that might be misspelled The algorithms transform one string into another The new string can then... matching algorithms that look weird at first because they do not match strings as such—they match bit patterns Instead of asking, "does this character match this character?" they twiddle bits around with binary arithmetic They do this by reducing both the pattern and the text down to bit patterns The crux of these algorithms is the iterative step: These algorithms are collectively called shift-op algorithms. .. operations are OR and + The state is initialized from the pattern P The . "escape." "ABCDE" == 65 * (2 56* *4 % 8355 967 ) + 66 * (2 56* *3 % 8355 967 ) + 67 * (2 56* *2 % 8355 967 ) + 68 * 2 56 + 69 == 65 * 167 12192 + 66 * 65 282 + 67 * 65 5 36 + 68 * 2 56 + 69 == == 377804 We. (2 56 + 1) q < 2, 147, 483, 64 7. The reason for using 2, 147, 483, 64 7, 2 31 - 1, instead of 4,294, 967 ,295, 2 32 - 1, will be explained shortly. The prime we are looking for is 8,355, 967 "Primecontinue Page 365 Numbers" in Chapter 12, Number Theory.) If, after each multiplication and sum, we calculate the result modulo 8,355, 967 , we are guaranteed never to surpass 2,147,483 ,64 7. Let's

Ngày đăng: 12/08/2014, 21:20

TỪ KHÓA LIÊN QUAN