Phân tích & Thiết kế Giải thuật nâng cao Brute-Force Algorithm

Outline and ReadingStrings §9.1.1 Pattern matching algorithms... The KMP Algorithm - MotivationKnuth-Morris-Pratt’s algorithm compares the pattern to the text in left-to-right, but shift

Trang 1

Pattern Matching

1

a b a c a a b

2 3 4

a b a c a b

Text processing

Trang 2

Outline and Reading

Strings (§9.1.1)

Pattern matching algorithms

Trang 3

A string is a sequence of

characters

Examples of strings:

 Java program

 HTML document

 DNA sequence

 Digitized image

An alphabet Σ is the set of

possible characters for a

family of strings

Example of alphabets:

 ASCII

 Unicode

 {0, 1}

 {A, C, G, T}

Let P be a string of size m

 A substring P[i j] of P is the subsequence of P consisting of

the characters with ranks

between i and j

 A prefix of P is a substring of the type P[0 i]

 A suffix of P is a substring of the type P[i m − 1]

Given strings T (text) and P

(pattern), the pattern matching problem consists of finding a

substring of T equal to P

Applications:

 Text editors

 Search engines

 Biological research

Trang 4

Brute-Force Algorithm

The brute-force pattern

matching algorithm compares

the pattern P with the text T

for each possible shift of P

relative to T, until either

 a match is found, or

 all placements of the pattern

have been tried

Brute-force pattern matching

runs in time O(nm)

Example of worst case:

 T = aaa … ah

 P = aaah

may occur in images and

Algorithm BruteForceMatch(T, P)

Input text T of size n and pattern

P of size m

Output starting index of a

substring of T equal to P or −1

if no such substring exists

for i ← 0 to n − m

{ test shift i of the pattern }

j ← 0

while j < m ∧ T[i + j] = P[j]

j ← j + 1

if j = m

return i {match at i}

else

Trang 5

Boyer-Moore Heuristics

The Boyer-Moore’s pattern matching algorithm is based on two

heuristics

Looking-glass heuristic: Compare P with a subsequence of T

moving backwards

Character-jump heuristic: When a mismatch occurs at T[i] = c

 If P contains c, shift P to align the last occurrence of c in P with T[i]

 Else, shift P to align P[0] with T[i + 1]

Example

1

a p a t t e r n m a t c h i n g a l g o r i t h m

r i t h m

2

3

4

5

6

7 8 9 10 11

Trang 6

Last-Occurrence Function

Boyer-Moore’s algorithm preprocesses the pattern P and the

alphabet Σ to build the last-occurrence function L mapping Σ to

integers, where L(c) is defined as

 the largest index i such that P[i] = c or

 − 1 if no such index exists

Example:

 Σ = {a, b, c, d}

 P = abacab

The last-occurrence function can be represented by an array indexed by the numeric codes of the characters

The last-occurrence function can be computed in time O(m + s),

where m is the size of P and s is the size of Σ

Trang 7

m − j

i

j l

. a .

. b a

j

Case 1: j ≤ 1 + l

The Boyer-Moore Algorithm

Algorithm BoyerMooreMatch(T, P, Σ )

i ← m − 1

j ← m − 1

repeat

if T[i] = P[j]

if j = 0

return i { match at i }

else

i ← i − 1

j ← j − 1

else

{ character-jump }

l ← L[T[i]]

i ← i + m – min(j, 1 + l)

j ← m − 1

until i > n − 1

return − 1 { no match }

m − (1 + l)

i

j l

. a .

. a . b .

1 + l

Case 2: 1 + l ≤ j

Trang 8

1

2 3 4

5 6

7

8 9 10 12

a b a c a b

11 13

Trang 9

Boyer-Moore’s algorithm

runs in time O(n+m + s)

Example of worst case:

 T = aaa … a

 P = baaa

The worst case may occur in

images and DNA sequences

but is unlikely in English text

Boyer-Moore’s algorithm is

significantly faster than the

brute-force algorithm on

English text

11

1

2 3 4 5

6

7 8 9 10 12

13 14 15 16 17 18

19 20 21 22 23 24

Trang 10

The KMP Algorithm - Motivation

Knuth-Morris-Pratt’s algorithm

compares the pattern to the

text in left-to-right, but

shifts the pattern more

intelligently than the

brute-force algorithm

When a mismatch occurs,

what is the most we can shift

the pattern so as to avoid

redundant comparisons?

Answer: the largest prefix of

P[0 j] that is a suffix of P[1 j]

x

j

a b a a b a

Trang 11

KMP Failure Function

Knuth-Morris-Pratt’s

algorithm preprocesses the

pattern to find matches of

prefixes of the pattern with

the pattern itself

defined as the size of the

largest prefix of P[0 j] that is

also a suffix of P[1 j]

Knuth-Morris-Pratt’s

algorithm modifies the

brute-force algorithm so that if a

mismatch occurs at P[j] ≠ T[i]

we set j ← F(j − 1)

j 0 1 2 3 4 5

P[j] a b a a b a F(j) 0 0 1 1 2 3

x

j

a b a a b a

Trang 12

The KMP Algorithm

The failure function can be

represented by an array and

can be computed in O(m) time

At each iteration of the

while-loop, either

 i increases by one, or

 the shift amount i − j

increases by at least one

(observe that F(j − 1) < j)

Hence, there are no more

than 2n iterations of the

while-loop

Thus, KMP’s algorithm runs in

Algorithm KMPMatch(T, P)

i ← 0

j ← 0

while i < n

if T[i] = P[j]

if j = m − 1

return i − j { match }

else

i ← i + 1

j ← j + 1

else

if j > 0

j ← F[j − 1]

else

i ← i + 1

Trang 13

Computing the Failure

Function

The failure function can be

represented by an array and

can be computed in O(m) time

The construction is similar to

the KMP algorithm itself

At each iteration of the

while-loop, either

 i increases by one, or

 the shift amount i − j

increases by at least one

(observe that F(j − 1) < j)

Hence, there are no more

than 2m iterations of the

while-loop

Algorithm failureFunction(P)

F[0] ← 0

i ← 1

j ← 0

while i < m

if P[i] = P[j]

{we have matched j + 1 chars}

F[i] ← j + 1

i ← i + 1

j ← j + 1

else if j > 0 then

{use failure function to shift P}

j ← F[j − 1]

else

F[i] ← 0 { no match }

i ← i + 1

Trang 14

1

7 8

19 18 17 15

16 14

13

2 3 4 5 6

9

a b a c a b

10 11 12

c

j 0 1 2 3 4 5

P[j] a b a c a b

Trang 15

Rabin-Karp Algorithm

We can view a string of k consecutive characters as

representing a length-k decimal number

T[1 n] for s = 0, 1, , n-m.

 t s = p if and only if

p = P[m] + 10(P[m-1] +10(P[m-2]+ +10(P[2]+10(P[1]))

We can compute p in O(m) time.

Trang 16

Example

6378 = 8 + 10 (7 + 10 (3 + 10(6)))

= 8 + 70 + 300 + 6000

Trang 17

Compute T s

t s+1 can be computed from t s in constant time.

t s+1 = 10( t s –10 m-1 T [ s +1])+ T [ s+m +1]

Example : T = 314152

Thus p and t 0 , t 1 , , t n-m can all be computed in O( n+m ) time

And all occurences of the pattern P[1 m ] in the text T[1 n] can be found in time O(n+m).

However, p and t s may be too large to work with

conveniently

Trang 18

Computation of p and t 0 using

fits within a computer word

The recurrence equation can be rewritten as

high order position of an m-digit text window

invalid shifts

Trang 19

Example

Example :

d=10, alphabet = {0…9}

T = 31415; P = 26, n = 5, m = 2, q = 11

We have:

p = 26 mod 11 = 4

t0 = 31 mod 11 = 9

t1 = (10(9 - 3(10) mod 11 ) + 4) mod 11

= (10 (9- 8) + 4) mod 11 = 14 mod 11 = 3

Trang 20

Rabin-Karp Implementation

Input : Text T, pattern P, radix d ( which is typically =Σ), and the prime q.

Output : valid shifts s where P matches

n ← length[T]; m ← length[P];

h ← d m-1 mod q; p ← 0; t 0 ← 0;

p ← (d×p + P[i] mod q;

t 0 ← (d×t 0 +T[i] mod q;

}

if (p = ts )

if (P[1 m] = T[s+1 s+m])

Định dạng
Số trang	20
Dung lượng	853 KB