0% found this document useful (0 votes)
4 views44 pages

1 Strings and PatternMatching

Uploaded by

srianvesh.567
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views44 pages

1 Strings and PatternMatching

Uploaded by

srianvesh.567
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 44

Pattern Matching

T: a b a c a a b
1
P: a b a c a b
4 3 2
a b a c a b

240-301 Comp. Eng. Lab III (Software), Pattern Matching 1


Overview

1. What is Pattern Matching?


2. The Brute Force Algorithm
3. The Boyer-Moore Algorithm
4. The Knuth-Morris-Pratt Algorithm
5. More Information

240-301 Comp. Eng. Lab III (Software), Pattern Matching 2


1. What is Pattern Matching?
 Definition:
– given a text string T and a pattern string P, find
the pattern inside the text
 T: “the rain in spain stays mainly on the plain”
 P: “n th”

 Applications:
– text editors, Web search engines (e.g. Google),
image analysis

240-301 Comp. Eng. Lab III (Software), Pattern Matching 3


String Concepts
 Assume S is a string of size m.

 A substring S[i .. j] of S is the string


fragment between indexes i and j.

 A prefix of S is a substring S[0 .. i]


 A suffix of S is a substring S[i .. m-1]
– i is any index between 0 and m-1

240-301 Comp. Eng. Lab III (Software), Pattern Matching 4


Examples S
a n d r e w
0 5
 Substring S[1..3] == "ndr"

 All possible prefixes of S:


– "andrew", "andre", "andr", "and", "an”, "a"

 All possible suffixes of S:


– "andrew", "ndrew", "drew", "rew", "ew", "w"

240-301 Comp. Eng. Lab III (Software), Pattern Matching 5


2. The Brute Force Algorithm
 Check each position in the text T to see if
the pattern P starts in that position

T: a n d r e w T: a n d r e w

P: r e w P: r e w
P moves 1 char at a time through T
....
240-301 Comp. Eng. Lab III (Software), Pattern Matching 6
Return index where
Brute Force in Java pattern starts, or -1

public static int brute(String text,String pattern)


{ int n = text.length(); // n is length of text
int m = pattern.length(); // m is length of pattern
int j;
for(int i=0; i <= (n-m); i++) {
j = 0;
while ((j < m) &&
(text.charAt(i+j) == pattern.charAt(j)) )
j++;
if (j == m)
return i; // match at i
}
return -1; // no match
} // end of brute()

240-301 Comp. Eng. Lab III (Software), Pattern Matching 7


Usage
public static void main(String args[])
{ if (args.length != 2) {
System.out.println("Usage: java BruteSearch
<text> <pattern>");
System.exit(0);
}
System.out.println("Text: " + args[0]);
System.out.println("Pattern: " + args[1]);

int posn = brute(args[0], args[1]);


if (posn == -1)
System.out.println("Pattern not found");
else
System.out.println("Pattern starts at posn "
+ posn);
}
240-301 Comp. Eng. Lab III (Software), Pattern Matching 8
Analysis

 Brute force pattern matching runs in time


O(mn) in the worst case.

 But most searches of ordinary text take


O(m+n), which is very quick.

240-301 Comp. Eng. Lab III (Software), Pattern Matching continued 9


 The brute force algorithm is fast when the
alphabet of the text is large
– e.g. A..Z, a..z, 1..9, etc.

 It is slower when the alphabet is small


– e.g. 0, 1 (as in binary files, image files, etc.)

240-301 Comp. Eng. Lab III (Software), Pattern Matching continued 10


 Example of a worst case:
– T: "aaaaaaaaaaaaaaaaaaaaaaaaaah"
– P: "aaah"

 Example of a more average case:


– T: "a string searching example is standard"
– P: "store"

240-301 Comp. Eng. Lab III (Software), Pattern Matching 11


3. The Boyer-Moore Algorithm
 The Boyer-Moore pattern matching
algorithm is based on two techniques.

 1. The looking-glass technique


– find P in T by moving backwards through P,
starting at its end

240-301 Comp. Eng. Lab III (Software), Pattern Matching 12


 2. The character-jump technique
– when a mismatch occurs at T[i] == x
– the character in pattern P[j] is not the
same as T[i]


T x a
There are 3 possible
cases, tried in order. i

P ba
j
240-301 Comp. Eng. Lab III (Software), Pattern Matching 13
Case 1
 If P contains x somewhere, then try to
shift P right to align the last occurrence
of x in P with T[i].

T x a T x a ? ?
i inew
and
move i and
j right, so
P x c ba j at end P x c ba
j jnew
240-301 Comp. Eng. Lab III (Software), Pattern Matching 14
Case 2
 If P contains x somewhere, but a shift right
to the last occurrence is not possible, then
shift P right by 1 character to T[i+1].

T x a x T xa x ?
i inew
and
move i and
j right, so
P cw ax j at end P cw ax
j x is after jnew
240-301 Comp. Eng. Lab j
III position
(Software), Pattern Matching 15
Case 3
 If cases 1 and 2 do not apply, then shift P to
align P[0] with T[i+1].

T x a T x a ? ? ?
i inew
and
move i and
j right, so
P d c ba j at end P d c ba
j 0 jnew
No x in P
240-301 Comp. Eng. Lab III (Software), Pattern Matching 16
Boyer-Moore Example (1)

T:
a p a t t e r n m a t c h i n g a l g o r i t h m

1 3 5 11 10 9 8 7
r i t h m r i t h m r i t h m r i t h m

P: 2 4 6
r i t h m r i t h m r i t h m

240-301 Comp. Eng. Lab III (Software), Pattern Matching 17


Last Occurrence Function
 Boyer-Moore’s algorithm preprocesses the
pattern P and the alphabet A to build a last
occurrence function L()
– L() maps all the letters in A to integers

 L(x) is defined as: // x is a letter in


A
– the largest index i such that P[i] == x, or
– -1 if no such index exists
240-301 Comp. Eng. Lab III (Software), Pattern Matching 18
L() Example
P a b a c a b
 A = {a, b, c, d} 0 1 2 3 4 5
 P: "abacab"

x a b c d
L(x) 4 5 3 -1

L() stores indexes into P[]

240-301 Comp. Eng. Lab III (Software), Pattern Matching 19


Note

 In Boyer-Moore code, L() is calculated


when the pattern P is read in.

 Usually L() is stored as an array


– something like the table in the previous slide

240-301 Comp. Eng. Lab III (Software), Pattern Matching 20


Boyer-Moore Example (2)
T: a b a c a a b a d c a b a c a b a a b b
1
P: a b a c a b
4 3 2 13 12 11 10 9 8
a b a c a b a b a c a b
5 7
a b a c a b a b a c a b
6
a b a c a b

x a b c d
L(x) 4 5 3 -1
240-301 Comp. Eng. Lab III (Software), Pattern Matching 21
Return index where
Boyer-Moore in Java pattern starts, or -1

public static int bmMatch(String text,


String
pattern)
{
int last[] = buildLast(pattern);
int n = text.length();
int m = pattern.length();
int i = m-1;

if (i > n-1)
return -1; // no match if pattern is
// longer than text
:

240-301 Comp. Eng. Lab III (Software), Pattern Matching 22


int j = m-1;
do {
if (pattern.charAt(j) == text.charAt(i))
if (j == 0)
return i; // match
else { // looking-glass technique
i--;
j--;
}
else { // character jump technique
int lo = last[text.charAt(i)]; //last occ
i = i + m - Math.min(j, 1+lo);
j = m - 1;
}
} while (i <= n-1);

return -1; // no match


} // end of bmMatch()
240-301 Comp. Eng. Lab III (Software), Pattern Matching 23
public static int[] buildLast(String pattern)
/* Return array storing index of last
occurrence of each ASCII char in pattern. */
{
int last[] = new int[128]; // ASCII char set

for(int i=0; i < 128; i++)


last[i] = -1; // initialize array

for (int i = 0; i < pattern.length(); i++)


last[pattern.charAt(i)] = i;

return last;
} // end of buildLast()

240-301 Comp. Eng. Lab III (Software), Pattern Matching 24


Usage
public static void main(String args[])
{ if (args.length != 2) {
System.out.println("Usage: java BmSearch
<text> <pattern>");
System.exit(0);
}
System.out.println("Text: " + args[0]);
System.out.println("Pattern: " + args[1]);

int posn = bmMatch(args[0], args[1]);


if (posn == -1)
System.out.println("Pattern not found");
else
System.out.println("Pattern starts at posn "
+ posn);
}
240-301 Comp. Eng. Lab III (Software), Pattern Matching 25
Analysis
 Boyer-Moore worst case running time is
O(nm + A)

 But, Boyer-Moore is fast when the alphabet


(A) is large, slow when the alphabet is small.
– e.g. good for English text, poor for binary

 Boyer-Moore is significantly faster than


brute force for searching English text.
240-301 Comp. Eng. Lab III (Software), Pattern Matching 26
Worst Case Example


T: a a a a a a a a a
T: "aaaaa…a"
6 5 4 3 2 1
 P: "baaaaa" P: b a a a a a
12 11 10 9 8 7
b a a a a a
18 17 16 15 14 13
b a a a a a
24 23 22 21 20 19
b a a a a a

240-301 Comp. Eng. Lab III (Software), Pattern Matching 27


4. The KMP Algorithm

 The Knuth-Morris-Pratt (KMP) algorithm


looks for the pattern in the text in a left-to-
right order (like the brute force algorithm).

 But it shifts the pattern more intelligently


than the brute force algorithm.

240-301 Comp. Eng. Lab III (Software), Pattern Matching continued 28


 If a mismatch occurs between the text and
pattern P at P[j], what is the most we can
shift the pattern to avoid wasteful
comparisons?

 Answer: the largest prefix of P[0 .. j-1] that


is a suffix of P[1 .. j-1]

240-301 Comp. Eng. Lab III (Software), Pattern Matching 29


Example i

T:

P: j=5

jnew = 2

240-301 Comp. Eng. Lab III (Software), Pattern Matching 30


Why
j == 5

 Find largest prefix (start) of:


"a b a a b" ( P[0..j-1] )

which is suffix (end) of:


"b a a b" ( p[1 .. j-1] )

 Answer: "a b"


 Set j = 2 // the new j value
240-301 Comp. Eng. Lab III (Software), Pattern Matching 31
KMP Failure Function
 KMP preprocesses the pattern to find
matches of prefixes of the pattern with the
pattern itself.
 j = mismatch position in P[]
 k = position before the mismatch (k = j-1).
 The failure function F(k) is defined as the
size of the largest prefix of P[0..k] that is
also a suffix of P[1..k].

240-301 Comp. Eng. Lab III (Software), Pattern Matching 32


Failure Function Example
(k == j-1)
 P: "abaaba" kj 0 1 2 3 4
j: 012345 F(j)
F(k) 0 0 1 1 2

F(k) is the size of


the largest prefix.

 In code, F() is represented by an array, like


the table.

240-301 Comp. Eng. Lab III (Software), Pattern Matching 33


Why is F(4) == 2? P: "abaaba"

 F(4) means
– find the size of the largest prefix of P[0..4] that
is also a suffix of P[1..4]
= find the size largest prefix of "abaab" that
is also a suffix of "baab"
= find the size of "ab"
=2

240-301 Comp. Eng. Lab III (Software), Pattern Matching 34


Using the Failure Function

 Knuth-Morris-Pratt’s algorithm modifies


the brute-force algorithm.
– if a mismatch occurs at P[j]
(i.e. P[j] != T[i]), then
k = j-1;
j = F(k); // obtain the new j

240-301 Comp. Eng. Lab III (Software), Pattern Matching 35


Return index where
KMP in Java pattern starts, or -1

public static int kmpMatch(String text,


String pattern)
{
int n = text.length();
int m = pattern.length();

int fail[] = computeFail(pattern);

int i=0;
int j=0;
:

240-301 Comp. Eng. Lab III (Software), Pattern Matching 36


while (i < n) {
if (pattern.charAt(j) == text.charAt(i)) {
if (j == m - 1)
return i - m + 1; // match
i++;
j++;
}
else if (j > 0)
j = fail[j-1];
else
i++;
}
return -1; // no match
} // end of kmpMatch()

240-301 Comp. Eng. Lab III (Software), Pattern Matching 37


public static int[] computeFail(
String pattern)
{
int fail[] = new int[pattern.length()];
fail[0] = 0;

int m = pattern.length();
int j = 0;
int i = 1;
:

240-301 Comp. Eng. Lab III (Software), Pattern Matching 38


while (i < m) {
if (pattern.charAt(j) ==
pattern.charAt(i)) { //j+1 chars match
fail[i] = j + 1;
i++;
j++;
}
else if (j > 0) // j follows matching prefix
j = fail[j-1];
else { // no match
fail[i] = 0;
i++;
} Similar code
}
return fail;
to kmpMatch()
} // end of computeFail()

240-301 Comp. Eng. Lab III (Software), Pattern Matching 39


Usage
public static void main(String args[])
{ if (args.length != 2) {
System.out.println("Usage: java KmpSearch
<text> <pattern>");
System.exit(0);
}
System.out.println("Text: " + args[0]);
System.out.println("Pattern: " + args[1]);

int posn = kmpMatch(args[0], args[1]);


if (posn == -1)
System.out.println("Pattern not found");
else
System.out.println("Pattern starts at posn "
+ posn);
}
240-301 Comp. Eng. Lab III (Software), Pattern Matching 40
Example
T: a b a c a a b a c c a b a c a b a a b b
1 2 3 4 5 6
P: a b a c a b
7
a b a c a b
8 9 10 11 12
a b a c a b
13
a b a c a b
k 0 1 2 3 4 14 15 16 17 18 19
F(k) 0 0 1 0 1 a b a c a b

240-301 Comp. Eng. Lab III (Software), Pattern Matching 41


Why is F(4) == 1? P: "abacab"

 F(4) means
– find the size of the largest prefix of P[0..4] that
is also a suffix of P[1..4]
= find the size largest prefix of "abaca" that
is also a suffix of "baca"
= find the size of "a"
=1

240-301 Comp. Eng. Lab III (Software), Pattern Matching 42


KMP Advantages

 KMP runs in optimal time: O(m+n)


– very fast

 The algorithm never needs to move


backwards in the input text, T
– this makes the algorithm good for processing
very large files that are read in from external
devices or through a network stream

240-301 Comp. Eng. Lab III (Software), Pattern Matching 43


KMP Disadvantages

 KMP doesn’t work so well as the size of the


alphabet increases
– more chance of a mismatch (more possible
mismatches)
– mismatches tend to occur early in the pattern,
but KMP is faster when the mismatches occur
later

240-301 Comp. Eng. Lab III (Software), Pattern Matching 44

You might also like