STRING SEARCHTNG
253
This leads to the very simple pattelm-matching algorithm implemented
below. The program assumes the same i.ldex function as above, but d=32 is
used for efficiency (the multiplications might be implemented as shifts).
function rksearch
:
integer;
const
q=33554393;
d=3.Z;
var
hl,
h2,
dM,
i: integer:
begin
dM:=l;
for
i:=l
to M-1 do dM:=(d*dM) mod q;
hl:=O;
for
i:=l
to M do
hl:=(hl*d+index(p[i]))
mod q;
h2:=0;
for
i:=l
to M do h2:=(h2*d+index(a[i])) mod q;
i:=l;
while
(hloh2)
and
(i<=N-M)
do
begin
h2:=(h2+d*q-index(,t[i])*dM) mod q;
h2:=(h2*d+index(a[i+M])) mod q;
i:=i+l;
end
;
rksearch
:=i;
end ;
The program first computes a hash
valle
hl for the pattern, then a hash
value h2 for the first M characters of the text. (Also it computes the value
of
d”-’
modq in the variable
dM.)
Then it proceeds through the text string,
using the technique above to compute the hash function for the M characters
starting at position i for each i, comparing each new hash value to hl. The
prime q is chosen to be as large as possible, but small enough that
(d+l)*q
doesn’t cause overflow: this requires less mod operations then if we used the
largest repesentable prime. (An extra d*q is added during the h2 calculation
to make sure that everything stays positive so that the mod operation works
as it should.)
This algorithm obviously takes time proportional to N + M. Note that
it really only finds a position in the text which has the same hash value as the
pattern, so, to be sure, we really should do a direct comparison of that text
with the pattern. However, the use of
suc:i
a large value of q, made possible by
the mod computations and by the fact that we don’t have to keep the actual
hash table around, make8 it extremely unlikely that a collision will occur.
Theoretically, this algorithm could still take NM steps in the (unbelievably)
worst case, but in practice the algorithm can be relied upon to take about
N + M steps.
254 CHAPTER 19
Multiple Searches
The algorithms that we’ve been discussing are all oriented towards a specific
string searching problem: find an occurrence of a given pattern in a given
text string. If the same text string is to be the object of many pattern
searches, then it will be worthwhile to do some processing on the string to
make subsequent searches efficient.
If there are a large number of searches, the string searching problem can
be viewed as a special case of the general searching problem that we studied
in the previous section. We simply treat the text string as N overlapping
“keys,” the ith key defined to be
a[l N],
the entire text string starting at
position i. Of course, we don’t manipulate the keys themselves, but pointers
to them: when we need to compare keys i and j we do character-by-character
compares starting at positions i and j of the text string. (If we use a “sentinel”
character larger than all other characters at the end, then one of the keys
will always be greater than the other.) Then the hashing, binary tree, and
other algorithms of the previous section can be used directly. First, an entire
structure is built up from the text string, and then efficient searches can be
performed for particular patterns.
There are many details which need to be worked out in applying searching
algorithms to string searching in this way; our intent is to point this out as
a viable option for some string searching applications. Different methods will
be appropriate in different situations. For example, if the searches will always
be for patterns of the same length, a hash table constructed with a single scan
as in the Rabin-Karp method will yield constant search times on the average.
On the other hand, if the patterns are to be of varying length, then one of the
tree-based methods might be appropriate. (Patricia is especially adaptable to
such an application.)
Other variations in the problem can make it significantly more difficult
and lead to drastically different methods, as we’ll discover in the next two
chapters.
r-l
STRING SEARCHING
255
Exercises
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
Implement a brute-force pattern ms.tching algorithm that scans the pat-
tern from right to left.
Give the next table for the Knuth-Morris-Pratt algorithm for the pattern
AAIWUAA.
Give the next table for the Knuth-Morris-Pratt algorithm for the pattern
AERACADABRA.
Draw a finite state machine which can search for the pattern
AE3RACAD
AFBA.
How would you search a text file fo; a string of 50 consecutive blanks?
Give the right-to-left skip table for the right-left scan for the pattern
AENACADABRA.
Construct an example for which the right-to-left pattern scan with only
the mismatch heuristic performs badly.
How would you modify the Rabin-Karp algorithm to search for a given
pattern with the additional proviso that the middle character is a “‘wild
card” (any text character at all can match it)?
Implement a version of the Rabin-Karp algorithm that can find a given
two-dimensional pattern in a given two-dimensional text. Assume both
pattern and text are rectangles of characters.
Write programs to generate a random
lOOO-bit
text string, then find all
occurrences of the last
/c
bits elsewhere in the string, for k =
5,10,15.
(Different methods might be appropriate for different values of k.)
. relied upon to take about
N + M steps.
254 CHAPTER 19
Multiple Searches
The algorithms that we’ve been discussing are all oriented towards a specific
string. always be greater than the other.) Then the hashing, binary tree, and
other algorithms of the previous section can be used directly. First, an entire
structure