Pattern Matching
T: a b a c a a b
1
P: a b a c a b
4 3 2
a b a c a b
240-301 Comp. Eng. Lab III (Software), Pattern Matching 1
Overview
1. What is Pattern Matching?
2. The Brute Force Algorithm
3. The Boyer-Moore Algorithm
4. The Knuth-Morris-Pratt Algorithm
5. More Information
240-301 Comp. Eng. Lab III (Software), Pattern Matching 2
1. What is Pattern Matching?
Definition:
– given a text string T and a pattern string P, find
the pattern inside the text
T: “the rain in spain stays mainly on the plain”
P: “n th”
Applications:
– text editors, Web search engines (e.g. Google),
image analysis
240-301 Comp. Eng. Lab III (Software), Pattern Matching 3
String Concepts
Assume S is a string of size m.
A substring S[i .. j] of S is the string
fragment between indexes i and j.
A prefix of S is a substring S[0 .. i]
A suffix of S is a substring S[i .. m-1]
– i is any index between 0 and m-1
240-301 Comp. Eng. Lab III (Software), Pattern Matching 4
Examples S
a n d r e w
0 5
Substring S[1..3] == "ndr"
All possible prefixes of S:
– "andrew", "andre", "andr", "and", "an”, "a"
All possible suffixes of S:
– "andrew", "ndrew", "drew", "rew", "ew", "w"
240-301 Comp. Eng. Lab III (Software), Pattern Matching 5
2. The Brute Force Algorithm
Check each position in the text T to see if
the pattern P starts in that position
T: a n d r e w T: a n d r e w
P: r e w P: r e w
P moves 1 char at a time through T
....
240-301 Comp. Eng. Lab III (Software), Pattern Matching 6
Return index where
Brute Force in Java pattern starts, or -1
public static int brute(String text,String pattern)
{ int n = text.length(); // n is length of text
int m = pattern.length(); // m is length of pattern
int j;
for(int i=0; i <= (n-m); i++) {
j = 0;
while ((j < m) &&
(text.charAt(i+j) == pattern.charAt(j)) )
j++;
if (j == m)
return i; // match at i
}
return -1; // no match
} // end of brute()
240-301 Comp. Eng. Lab III (Software), Pattern Matching 7
Usage
public static void main(String args[])
{ if (args.length != 2) {
System.out.println("Usage: java BruteSearch
<text> <pattern>");
System.exit(0);
}
System.out.println("Text: " + args[0]);
System.out.println("Pattern: " + args[1]);
int posn = brute(args[0], args[1]);
if (posn == -1)
System.out.println("Pattern not found");
else
System.out.println("Pattern starts at posn "
+ posn);
}
240-301 Comp. Eng. Lab III (Software), Pattern Matching 8
Analysis
Brute force pattern matching runs in time
O(mn) in the worst case.
But most searches of ordinary text take
O(m+n), which is very quick.
240-301 Comp. Eng. Lab III (Software), Pattern Matching continued 9
The brute force algorithm is fast when the
alphabet of the text is large
– e.g. A..Z, a..z, 1..9, etc.
It is slower when the alphabet is small
– e.g. 0, 1 (as in binary files, image files, etc.)
240-301 Comp. Eng. Lab III (Software), Pattern Matching continued 10
Example of a worst case:
– T: "aaaaaaaaaaaaaaaaaaaaaaaaaah"
– P: "aaah"
Example of a more average case:
– T: "a string searching example is standard"
– P: "store"
240-301 Comp. Eng. Lab III (Software), Pattern Matching 11
3. The Boyer-Moore Algorithm
The Boyer-Moore pattern matching
algorithm is based on two techniques.
1. The looking-glass technique
– find P in T by moving backwards through P,
starting at its end
240-301 Comp. Eng. Lab III (Software), Pattern Matching 12
2. The character-jump technique
– when a mismatch occurs at T[i] == x
– the character in pattern P[j] is not the
same as T[i]
T x a
There are 3 possible
cases, tried in order. i
P ba
j
240-301 Comp. Eng. Lab III (Software), Pattern Matching 13
Case 1
If P contains x somewhere, then try to
shift P right to align the last occurrence
of x in P with T[i].
T x a T x a ? ?
i inew
and
move i and
j right, so
P x c ba j at end P x c ba
j jnew
240-301 Comp. Eng. Lab III (Software), Pattern Matching 14
Case 2
If P contains x somewhere, but a shift right
to the last occurrence is not possible, then
shift P right by 1 character to T[i+1].
T x a x T xa x ?
i inew
and
move i and
j right, so
P cw ax j at end P cw ax
j x is after jnew
240-301 Comp. Eng. Lab j
III position
(Software), Pattern Matching 15
Case 3
If cases 1 and 2 do not apply, then shift P to
align P[0] with T[i+1].
T x a T x a ? ? ?
i inew
and
move i and
j right, so
P d c ba j at end P d c ba
j 0 jnew
No x in P
240-301 Comp. Eng. Lab III (Software), Pattern Matching 16
Boyer-Moore Example (1)
T:
a p a t t e r n m a t c h i n g a l g o r i t h m
1 3 5 11 10 9 8 7
r i t h m r i t h m r i t h m r i t h m
P: 2 4 6
r i t h m r i t h m r i t h m
240-301 Comp. Eng. Lab III (Software), Pattern Matching 17
Last Occurrence Function
Boyer-Moore’s algorithm preprocesses the
pattern P and the alphabet A to build a last
occurrence function L()
– L() maps all the letters in A to integers
L(x) is defined as: // x is a letter in
A
– the largest index i such that P[i] == x, or
– -1 if no such index exists
240-301 Comp. Eng. Lab III (Software), Pattern Matching 18
L() Example
P a b a c a b
A = {a, b, c, d} 0 1 2 3 4 5
P: "abacab"
x a b c d
L(x) 4 5 3 -1
L() stores indexes into P[]
240-301 Comp. Eng. Lab III (Software), Pattern Matching 19
Note
In Boyer-Moore code, L() is calculated
when the pattern P is read in.
Usually L() is stored as an array
– something like the table in the previous slide
240-301 Comp. Eng. Lab III (Software), Pattern Matching 20
Boyer-Moore Example (2)
T: a b a c a a b a d c a b a c a b a a b b
1
P: a b a c a b
4 3 2 13 12 11 10 9 8
a b a c a b a b a c a b
5 7
a b a c a b a b a c a b
6
a b a c a b
x a b c d
L(x) 4 5 3 -1
240-301 Comp. Eng. Lab III (Software), Pattern Matching 21
Return index where
Boyer-Moore in Java pattern starts, or -1
public static int bmMatch(String text,
String
pattern)
{
int last[] = buildLast(pattern);
int n = text.length();
int m = pattern.length();
int i = m-1;
if (i > n-1)
return -1; // no match if pattern is
// longer than text
:
240-301 Comp. Eng. Lab III (Software), Pattern Matching 22
int j = m-1;
do {
if (pattern.charAt(j) == text.charAt(i))
if (j == 0)
return i; // match
else { // looking-glass technique
i--;
j--;
}
else { // character jump technique
int lo = last[text.charAt(i)]; //last occ
i = i + m - Math.min(j, 1+lo);
j = m - 1;
}
} while (i <= n-1);
return -1; // no match
} // end of bmMatch()
240-301 Comp. Eng. Lab III (Software), Pattern Matching 23
public static int[] buildLast(String pattern)
/* Return array storing index of last
occurrence of each ASCII char in pattern. */
{
int last[] = new int[128]; // ASCII char set
for(int i=0; i < 128; i++)
last[i] = -1; // initialize array
for (int i = 0; i < pattern.length(); i++)
last[pattern.charAt(i)] = i;
return last;
} // end of buildLast()
240-301 Comp. Eng. Lab III (Software), Pattern Matching 24
Usage
public static void main(String args[])
{ if (args.length != 2) {
System.out.println("Usage: java BmSearch
<text> <pattern>");
System.exit(0);
}
System.out.println("Text: " + args[0]);
System.out.println("Pattern: " + args[1]);
int posn = bmMatch(args[0], args[1]);
if (posn == -1)
System.out.println("Pattern not found");
else
System.out.println("Pattern starts at posn "
+ posn);
}
240-301 Comp. Eng. Lab III (Software), Pattern Matching 25
Analysis
Boyer-Moore worst case running time is
O(nm + A)
But, Boyer-Moore is fast when the alphabet
(A) is large, slow when the alphabet is small.
– e.g. good for English text, poor for binary
Boyer-Moore is significantly faster than
brute force for searching English text.
240-301 Comp. Eng. Lab III (Software), Pattern Matching 26
Worst Case Example
T: a a a a a a a a a
T: "aaaaa…a"
6 5 4 3 2 1
P: "baaaaa" P: b a a a a a
12 11 10 9 8 7
b a a a a a
18 17 16 15 14 13
b a a a a a
24 23 22 21 20 19
b a a a a a
240-301 Comp. Eng. Lab III (Software), Pattern Matching 27
4. The KMP Algorithm
The Knuth-Morris-Pratt (KMP) algorithm
looks for the pattern in the text in a left-to-
right order (like the brute force algorithm).
But it shifts the pattern more intelligently
than the brute force algorithm.
240-301 Comp. Eng. Lab III (Software), Pattern Matching continued 28
If a mismatch occurs between the text and
pattern P at P[j], what is the most we can
shift the pattern to avoid wasteful
comparisons?
Answer: the largest prefix of P[0 .. j-1] that
is a suffix of P[1 .. j-1]
240-301 Comp. Eng. Lab III (Software), Pattern Matching 29
Example i
T:
P: j=5
jnew = 2
240-301 Comp. Eng. Lab III (Software), Pattern Matching 30
Why
j == 5
Find largest prefix (start) of:
"a b a a b" ( P[0..j-1] )
which is suffix (end) of:
"b a a b" ( p[1 .. j-1] )
Answer: "a b"
Set j = 2 // the new j value
240-301 Comp. Eng. Lab III (Software), Pattern Matching 31
KMP Failure Function
KMP preprocesses the pattern to find
matches of prefixes of the pattern with the
pattern itself.
j = mismatch position in P[]
k = position before the mismatch (k = j-1).
The failure function F(k) is defined as the
size of the largest prefix of P[0..k] that is
also a suffix of P[1..k].
240-301 Comp. Eng. Lab III (Software), Pattern Matching 32
Failure Function Example
(k == j-1)
P: "abaaba" kj 0 1 2 3 4
j: 012345 F(j)
F(k) 0 0 1 1 2
F(k) is the size of
the largest prefix.
In code, F() is represented by an array, like
the table.
240-301 Comp. Eng. Lab III (Software), Pattern Matching 33
Why is F(4) == 2? P: "abaaba"
F(4) means
– find the size of the largest prefix of P[0..4] that
is also a suffix of P[1..4]
= find the size largest prefix of "abaab" that
is also a suffix of "baab"
= find the size of "ab"
=2
240-301 Comp. Eng. Lab III (Software), Pattern Matching 34
Using the Failure Function
Knuth-Morris-Pratt’s algorithm modifies
the brute-force algorithm.
– if a mismatch occurs at P[j]
(i.e. P[j] != T[i]), then
k = j-1;
j = F(k); // obtain the new j
240-301 Comp. Eng. Lab III (Software), Pattern Matching 35
Return index where
KMP in Java pattern starts, or -1
public static int kmpMatch(String text,
String pattern)
{
int n = text.length();
int m = pattern.length();
int fail[] = computeFail(pattern);
int i=0;
int j=0;
:
240-301 Comp. Eng. Lab III (Software), Pattern Matching 36
while (i < n) {
if (pattern.charAt(j) == text.charAt(i)) {
if (j == m - 1)
return i - m + 1; // match
i++;
j++;
}
else if (j > 0)
j = fail[j-1];
else
i++;
}
return -1; // no match
} // end of kmpMatch()
240-301 Comp. Eng. Lab III (Software), Pattern Matching 37
public static int[] computeFail(
String pattern)
{
int fail[] = new int[pattern.length()];
fail[0] = 0;
int m = pattern.length();
int j = 0;
int i = 1;
:
240-301 Comp. Eng. Lab III (Software), Pattern Matching 38
while (i < m) {
if (pattern.charAt(j) ==
pattern.charAt(i)) { //j+1 chars match
fail[i] = j + 1;
i++;
j++;
}
else if (j > 0) // j follows matching prefix
j = fail[j-1];
else { // no match
fail[i] = 0;
i++;
} Similar code
}
return fail;
to kmpMatch()
} // end of computeFail()
240-301 Comp. Eng. Lab III (Software), Pattern Matching 39
Usage
public static void main(String args[])
{ if (args.length != 2) {
System.out.println("Usage: java KmpSearch
<text> <pattern>");
System.exit(0);
}
System.out.println("Text: " + args[0]);
System.out.println("Pattern: " + args[1]);
int posn = kmpMatch(args[0], args[1]);
if (posn == -1)
System.out.println("Pattern not found");
else
System.out.println("Pattern starts at posn "
+ posn);
}
240-301 Comp. Eng. Lab III (Software), Pattern Matching 40
Example
T: a b a c a a b a c c a b a c a b a a b b
1 2 3 4 5 6
P: a b a c a b
7
a b a c a b
8 9 10 11 12
a b a c a b
13
a b a c a b
k 0 1 2 3 4 14 15 16 17 18 19
F(k) 0 0 1 0 1 a b a c a b
240-301 Comp. Eng. Lab III (Software), Pattern Matching 41
Why is F(4) == 1? P: "abacab"
F(4) means
– find the size of the largest prefix of P[0..4] that
is also a suffix of P[1..4]
= find the size largest prefix of "abaca" that
is also a suffix of "baca"
= find the size of "a"
=1
240-301 Comp. Eng. Lab III (Software), Pattern Matching 42
KMP Advantages
KMP runs in optimal time: O(m+n)
– very fast
The algorithm never needs to move
backwards in the input text, T
– this makes the algorithm good for processing
very large files that are read in from external
devices or through a network stream
240-301 Comp. Eng. Lab III (Software), Pattern Matching 43
KMP Disadvantages
KMP doesn’t work so well as the size of the
alphabet increases
– more chance of a mismatch (more possible
mismatches)
– mismatches tend to occur early in the pattern,
but KMP is faster when the mismatches occur
later
240-301 Comp. Eng. Lab III (Software), Pattern Matching 44