IS 7118 Unit-2 Regular Expressions
IS 7118 Unit-2 Regular Expressions
(Text Processing)
IS 7118: Natural Language Processing
1st Year, 2nd Semester, M.Sc(IS)
(Slides are adapted from Text Book by Jurafsky & Martin )
Instructor: Prof. Rama Krishna Rao Bandaru
Contents of Unit-2
• Regular Expressions
– Basic Regular Expression Patterns
– Disjunction, Grouping and Precedence
– A simple Example
– Advanced Operators
• Finite State Automata
– Use of FSA to Recognize Sheep-talk
– Formal Languages
– Non-Deterministic FSAs
– Use of NFSA to Accept Strings
– Recognition as Search
– Relation of Deterministic and Non-Deterministic Automata
• Regular Languages and FSAs
IS 7118: NLP Unit-2 RE & FSA, 2
Prof. R.K.Rao Bandaru
Regular Expression(RE)
• A regular expression (first developed by Kleene (1956)) is a formula
in a special language that is used for specifying simple classes of
strings.
• Formally, a regular expression is a algebraic notation to characterize
a set of strings. Thus they can be used to specify search strings as well
as to define a language in
• It is a metalanguage for text searching. It requires a ‘pattern’ that we
want to search for and a ‘corpus’ of text to search through. It returns
all texts that match the pattern.
• REs
– Character sequence
– Kleene star
– Character set, complement set
– Anchors
– Disjunction IS 7118: NLP Unit-2 RE & FSA, 3
– Grouping Prof. R.K.Rao Bandaru
Definition of a Regular Expression
• RE is a linguist formalism . Given an alphabet , the set of
regular expressions over is defined as follows:
1. If a is a letter in alphabet , then a is a regular expression
2. ε, standing for empty string is a regular expression;
3. Ø, standing for the empty set is a regular expression;
4. If r1 and r2 are regular expressions, then are (r1+r2 ) and (r1.r2), here
+ signifies union , sometimes | is used, and . is a concatenation;
5. If r is a regular expression, then so is (r)* , which signifies closure
operation;
6. Nothing else is a regular expression over .
– Note: This definition may seem circular, but 1-3 form the basis. Parentheses
have highest precedence, followed by *, concatenation, and then union.
• Example: Let be the alphabet {a,b,c,…,y,z}. Some regular
expressions over this alphabet are :
Ø,a ((c.a).t), (((m.e).(o)*).w), (a+e(i+(o+u)))), ((a+ (e+ (i+ (o+u))))*, etc.
IS 7118: NLP Unit-2 RE & FSA, 4
Prof. R.K.Rao Bandaru
Regular expressions
• Ranges [A-Z]
Pattern Matches
[A-Z] An upper case Drenched Blossoms
letter
[a-z] A lower case my beans were impatient
letter
[0-9] A single digit Chapter 1: Down the Rabbit
6
IS 7118: NLP Unit-2 RE & FSA,
Regular Expressions: Negation in
Disjunction
• Negations [^Ss]
– Carat means negation only when first in []
Pattern Matches
[^A-Z] Not an upper case letter Oyfn pripetchik
[^Ss] Neither ‘S’ nor ‘s’ I have no exquisite
reason”
[^e^] Neither e nor ^ Look here
Not a period our resident Djinn
[^\.]
a^b The pattern a carat b Look up a^b now
IS 7118: NLP Unit-2 RE & FSA, 7
Prof. R.K.Rao Bandaru
Regular Expression ‘?’
Pattern Matches
groundhog|woodchuck
yours|mine yours mine
a|b|c = [abc]
[gG]roundhog|[Ww]oodch
uck IS 7118: NLP Unit-2 RE & FSA,
Photo D. Fletcher
9
Prof. R.K.Rao Bandaru
Regular Expressions: ? * + .
Pattern Matches
oo*h! 0 or more of oh! ooh! oooh!
previous char ooooh!
Kleene *, Kleene +
IS 7118: NLP Unit-2 RE & FSA, 10
Prof. R.K.Rao Bandaru
Operator Precedence Hierarchy
1. Parentheses ()
2. Counters * + ? {}
3. Sequence and Anchors the ^my end$
4. Disjunction |
• Anchors
Pattern Matches
^[A-Z] Palo Alto
^[^A-Za-z] 1 “Hello”
\.$ The end.
.$ The end? The
end!
/[^a-zA-Z][tT]he[^a-zA-Z]/
The recent attempt by the police to retain their
current rates of pay has not gathered much
favor with the southern factions.
• Still , this won’t find the word ‘the’ when it begins a
line.
/(^|[^a-zA-Z])[tT]he[^a-zA-Z]|$/
RE Match
* zero or more occurrences of the previous char or expression
+ one or more occurrences of the previous char or expression
? exactly zero or one occurrences of the previous char or expression
{n} n occurrences of the previous char or expression
{n,m} from n to m occurrences of the previous char or expression
{n,} at least n occurrences of the previous char or expression
b a a a ! \
q0 q1 q2 q2 q3 q4
IS 7118: NLP Unit-2 RE & FSA, 47
Prof. R.K.Rao Bandaru
Tracing Execution of NDFA for sheep
talk Example : Depth First Search
Figure: Automata for the base case (no operators) for the induction showing
that an regular expression can be turned into an equivalent automaton.
???