0% found this document useful (0 votes)
157 views

IS 7118 Unit-2 Regular Expressions

This document provides an overview of the contents of Unit 2 which covers regular expressions and finite state automata. It defines regular expressions as a formal language used to specify text strings and search patterns. It describes basic regular expression patterns including characters, character classes, anchors, disjunction, grouping and precedence. It also covers finite state automata and how they can be used to recognize languages and accept strings. Examples are provided to demonstrate regular expression patterns and operators.

Uploaded by

Jmpol John
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
157 views

IS 7118 Unit-2 Regular Expressions

This document provides an overview of the contents of Unit 2 which covers regular expressions and finite state automata. It defines regular expressions as a formal language used to specify text strings and search patterns. It describes basic regular expression patterns including characters, character classes, anchors, disjunction, grouping and precedence. It also covers finite state automata and how they can be used to recognize languages and accept strings. Examples are provided to demonstrate regular expression patterns and operators.

Uploaded by

Jmpol John
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

Unit-2: Regular Expressions

(Text Processing)
IS 7118: Natural Language Processing
1st Year, 2nd Semester, M.Sc(IS)
(Slides are adapted from Text Book by Jurafsky & Martin )
Instructor: Prof. Rama Krishna Rao Bandaru
Contents of Unit-2
• Regular Expressions
– Basic Regular Expression Patterns
– Disjunction, Grouping and Precedence
– A simple Example
– Advanced Operators
• Finite State Automata
– Use of FSA to Recognize Sheep-talk
– Formal Languages
– Non-Deterministic FSAs
– Use of NFSA to Accept Strings
– Recognition as Search
– Relation of Deterministic and Non-Deterministic Automata
• Regular Languages and FSAs
IS 7118: NLP Unit-2 RE & FSA, 2
Prof. R.K.Rao Bandaru
Regular Expression(RE)
• A regular expression (first developed by Kleene (1956)) is a formula
in a special language that is used for specifying simple classes of
strings.
• Formally, a regular expression is a algebraic notation to characterize
a set of strings. Thus they can be used to specify search strings as well
as to define a language in
• It is a metalanguage for text searching. It requires a ‘pattern’ that we
want to search for and a ‘corpus’ of text to search through. It returns
all texts that match the pattern.
• REs
– Character sequence
– Kleene star
– Character set, complement set
– Anchors
– Disjunction IS 7118: NLP Unit-2 RE & FSA, 3
– Grouping Prof. R.K.Rao Bandaru
Definition of a Regular Expression
• RE is a linguist formalism . Given an alphabet , the set of
regular expressions over  is defined as follows:
1. If a is a letter in alphabet , then a is a regular expression
2. ε, standing for empty string is a regular expression;
3. Ø, standing for the empty set is a regular expression;
4. If r1 and r2 are regular expressions, then are (r1+r2 ) and (r1.r2), here
+ signifies union , sometimes | is used, and . is a concatenation;
5. If r is a regular expression, then so is (r)* , which signifies closure
operation;
6. Nothing else is a regular expression over .
– Note: This definition may seem circular, but 1-3 form the basis. Parentheses
have highest precedence, followed by *, concatenation, and then union.
• Example: Let  be the alphabet {a,b,c,…,y,z}. Some regular
expressions over this alphabet are :
Ø,a ((c.a).t), (((m.e).(o)*).w), (a+e(i+(o+u)))), ((a+ (e+ (i+ (o+u))))*, etc.
IS 7118: NLP Unit-2 RE & FSA, 4
Prof. R.K.Rao Bandaru
Regular expressions

• A formal language for specifying text strings


• How can we search for any of these?
– woodchuck
– woodchucks
– Woodchuck
– Woodchucks

IS 7118: NLP Unit-2 RE & FSA, 5


Prof. R.K.Rao Bandaru
Regular Expressions: Disjunctions
• Letters inside square brackets []
Pattern Matches
[wW]oodchuck Woodchuck, woodchuck
[1234567890] Any digit

• Ranges [A-Z]
Pattern Matches
[A-Z] An upper case Drenched Blossoms
letter
[a-z] A lower case my beans were impatient
letter
[0-9] A single digit Chapter 1: Down the Rabbit
6
IS 7118: NLP Unit-2 RE & FSA,
Regular Expressions: Negation in
Disjunction

• Negations [^Ss]
– Carat means negation only when first in []
Pattern Matches
[^A-Z] Not an upper case letter Oyfn pripetchik
[^Ss] Neither ‘S’ nor ‘s’ I have no exquisite
reason”
[^e^] Neither e nor ^ Look here
Not a period our resident Djinn
[^\.]
a^b The pattern a carat b Look up a^b now
IS 7118: NLP Unit-2 RE & FSA, 7
Prof. R.K.Rao Bandaru
Regular Expression ‘?’

• The question mark ? means “zero or one


instances of previous character”

Pattern Matches Example Pattern


matched
Woodchucks? Woodchuck or “woodchuck”
woodchuck
colou?r Color or colour “colour”

IS 7118: NLP Unit-2 RE & FSA, 8


Prof. R.K.Rao Bandaru
Regular Expressions: More
Disjunction
• Woodchucks is another name for groundhog!
• The pipe | for disjunction

Pattern Matches
groundhog|woodchuck
yours|mine yours mine

a|b|c = [abc]
[gG]roundhog|[Ww]oodch
uck IS 7118: NLP Unit-2 RE & FSA,
Photo D. Fletcher
9
Prof. R.K.Rao Bandaru
Regular Expressions: ? * + .
Pattern Matches
oo*h! 0 or more of oh! ooh! oooh!
previous char ooooh!

o+h! 1 or more of oh! ooh! oooh!


previous char ooooh!

baa+ baa baaa baaaa


baaaaa
beg.n begin begun begun
beg3n Stephen C Kleene

Kleene *, Kleene +
IS 7118: NLP Unit-2 RE & FSA, 10
Prof. R.K.Rao Bandaru
Operator Precedence Hierarchy
1. Parentheses ()
2. Counters * + ? {}
3. Sequence and Anchors the ^my end$
4. Disjunction |

Thus, because counters have a higher precedence than


sequences, /the*/ matches ‘theeeeee’ but not ‘thethe.’
Similarly, because sequences have higher precedence than
disjunction, /the|any/ matches ‘the’ or ‘any’ but not
‘theny’.
IS 7118: NLP Unit-2 RE & FSA, 11
Prof. R.K.Rao Bandaru
Regular Expressions: Anchors ^ $

• Anchors

Pattern Matches
^[A-Z] Palo Alto
^[^A-Za-z] 1 “Hello”
\.$ The end.
.$ The end? The
end!

IS 7118: NLP Unit-2 RE & FSA, 12


Prof. R.K.Rao Bandaru
A Simple Exercise
• Write a regular expression to find all instances
of the determiner “the”:

The recent attempt by the police to retain their


current rates of pay has not gathered much
favor with the southern factions.

IS 7118: NLP Unit-2 RE & FSA, 13


Prof. R.K.Rao Bandaru
A Simple Exercise
• Write a regular expression to find all instances
of the determiner “the”:
/the/
The recent attempt by the police to retain their
current rates of pay has not gathered much
favor with the southern factions.

IS 7118: NLP Unit-2 RE & FSA, 14


Prof. R.K.Rao Bandaru
A Simple Exercise
• Write a regular expression to find all instances
of the determiner “the”:
/[Tt]he/
The recent attempt by the police to retain their
current rates of pay has not gathered much
favor with the southern factions.

IS 7118: NLP Unit-2 RE & FSA, 15


Prof. R.K.Rao Bandaru
A Simple Exercise
• Write a regular expression to find all instances
of the determiner “the”:
/\b[Tt]he\b/
The recent attempt by the police to retain their
current rates of pay has not gathered much
favor with the southern factions.

• But, suppose we want find the, where in some


context have underscores or numbers nearby,
for example : ‘the_ ‘ or ‘the24’, then
IS 7118: NLP Unit-2 RE & FSA, 16
Prof. R.K.Rao Bandaru
A Simple Exercise
• Write a regular expression to find all instances
of the determiner “the”:

/[^a-zA-Z][tT]he[^a-zA-Z]/
The recent attempt by the police to retain their
current rates of pay has not gathered much
favor with the southern factions.
• Still , this won’t find the word ‘the’ when it begins a
line.

IS 7118: NLP Unit-2 RE & FSA, 17


Prof. R.K.Rao Bandaru
A Simple Exercise
• Write a regular expression to find all instances
of the determiner “the”:

/(^|[^a-zA-Z])[tT]he[^a-zA-Z]|$/

The recent attempt by the police to retain their


current rates of pay has not gathered much
favor with the southern factions.

IS 7118: NLP Unit-2 RE & FSA, 18


Prof. R.K.Rao Bandaru
Errors
• The process we just went through was based on fixing
two kinds of errors

– Matching strings that we should not have matched (there,


then, other)
• False positives (Type I)

– Not matching things that we should have matched (The)


• False negatives (Type II)

IS 7118: NLP Unit-2 RE & FSA, 19


Prof. R.K.Rao Bandaru
Errors cont.
• In NLP we are always dealing with these kinds of
errors.
• Reducing the error rate for an application often
involves two antagonistic efforts:
– Increasing accuracy or precision (minimizing false
positives)
– Increasing coverage or recall (minimizing false negatives).

IS 7118: NLP Unit-2 RE & FSA, 20


Prof. R.K.Rao Bandaru
Advanced RE Operators(1)
• The RE /{3}/ means “exactly 3 occurrences” of the previous
character or expression.
• E.g., /a\.{24}z/ will match ‘a’ followed by 24 dots followed by
z.

RE Match
* zero or more occurrences of the previous char or expression
+ one or more occurrences of the previous char or expression
? exactly zero or one occurrences of the previous char or expression
{n} n occurrences of the previous char or expression
{n,m} from n to m occurrences of the previous char or expression
{n,} at least n occurrences of the previous char or expression

IS 7118: NLP Unit-2 RE & FSA, 21


Prof. R.K.Rao Bandaru
Advanced RE Operators(2)
• Aliases for common sets of characters:
RE Expansion Match Examples
\d [0-9] any digit Party˽of˽5
\D [^0-9] any nont-digit Blue˽moon
\w [a-zA-Z0-9_] any Daiyu
alphanumeric/underscore
\W \^w a non-alphanumeric !!!!!
\s [ ˽\r\t\n\f] whitespace( space,tab)
\S \^s Non-whitespace in˽Concord

IS 7118: NLP Unit-2 RE & FSA, 22


Prof. R.K.Rao Bandaru
Advanced RE Operators(3)
• Some characters need to be backslashed:
RE Match Example Patterns mat ched
\* an asterisk “*” “K*A*P*L*A*N”
\. a period “.” “Dr. Livingston, I presume”
\? a question mark “why don’t they come to lend a hand?
\n a newline
\t a tab

IS 7118: NLP Unit-2 RE & FSA, 23


Prof. R.K.Rao Bandaru
Summary
• Regular expression play a surprisingly large
role
– Sophisticated sequences of regular expressions are
often the first model for any text processing text
• For many hard tasks, we use machine learning
classifiers
– But regular expressions are used as features in the
classifiers
– Can be very useful in capturing generalizations

IS 7118: NLP Unit-2 RE & FSA, 24


Prof. R.K.Rao Bandaru
Finite State Automata (FSA)
• Finite Sate Automata(FSA) consists of finite set of
states(nodes or vertices), connected by finite number of
transitions (edges or links). Each of the transition is labeled
by a letter, taken from some finite alphabet  .
• A computation starts at a designated state, (start or initial
state), and it moves form one state to another along the
labeled transitions, till they reach a final state (accepting
states).
• FSA also capture significant aspects of what linguists say
we need for morphology and parts of syntax.
• RE can be viewed as a textual way of specifying the
structure of FSA.
IS 7118: NLP Unit-2 RE & FSA, 25
Prof. R.K.Rao Bandaru
Regular Language and Regular Grammar
• Regular expression is one way of characterizing a particular
kind of formal language called regular language. Both RE and
FSA can be used to describe regular language.
• The third equivalent method of characterizing the regular
languages is the regular grammar.

IS 7118: NLP Unit-2 RE & FSA, 26


Prof. R.K.Rao Bandaru
FSAs as Graphs
• Let’s start with the sheep language from
previous discussion.
baa!
baaa!
baaaa!
….
– /baa+!/

IS 7118: NLP Unit-2 RE & FSA, Prof. R.K.Rao Bandaru 27


Sheep FSA
• We can say the following things about this
machine
– It has 5 states
– b, a, and ! are in its alphabet
– q0 is the start state
– q4 is an accept state
– It has 5 transitions

IS 7118: NLP Unit-2 RE & FSA, Prof. R.K.Rao Bandaru 28


Definition
• A finite state automaton is a five-tuple (Q,q0,∑,δ,F),
where :
• ∑ is a finite set of alphabet symbols,
• Q is a finite set of states {q0q1q2…qN-1}
• q0  Q is the initial state,
• F subset of Q is a set of final states, and
• δ (q,i) : the transition function or transition matrix
between states. Given that q  Q and an input symbol
i  ∑ , δ (q,i) returns a new state q’ Q, δ is a relation
from ∑×Q to Q,
IS 7118: NLP Unit-2 RE & FSA, 29
Prof. R.K.Rao Bandaru
State Transition Table
• Example: For the sheep talk, Q= {q0q1q2,q3,q4},
∑={a,b,!},F={q4} and δ (q,i) is defined in transition table
below.

State transition table for sheep talk:

• Example: In FSA the final state is depicted by two concentric


circles. In this finite state automaton A= (Q,q0,∑,δ,F),
∑={c,a,t,r}, F={q3}, δ ={(q0,c,q1), (q1,a,q2), (q2,t,q3), (q2,r,q3)}
is depicted as follows:

IS 7118: NLP Unit-2 RE & FSA, 30


Prof. R.K.Rao Bandaru
Recognition
• Recognition is the process of determining if a string should be
accepted by a machine
• Or… it’s the process of determining if a string is in the language
we’re defining with the machine
• Or… it’s the process of determining if a regular expression matches
a string
• Those all amount the same thing in the end.
• A deterministic algorithm is one that has no choice points; the
algorithm always knows what to do for any input.
• One such algorithm is D-RECOGNIZE is for ‘deterministic
recognizer’ .
• It takes as input a tape and an automaton.
IS 7118: NLP Unit-2 RE & FSA, 31
Prof. R.K.Rao Bandaru
Recognition
• Traditionally, (Turing’s notion) this process is depicted with a tape.

Figure: A tape with cells

• Simply a process of starting in the start state


• Examining the current input
• Consulting the table
• Going to a new state and updating the tape pointer.
• Until we run out of tape.

Figure: A tape with cells

IS 7118: NLP Unit-2 RE & FSA, Prof. R.K.Rao Bandaru 32


D-RECOGNIZE
Function D-RECOGNIZE(tape, machine) returns accept or reject
index ← Beginning of tape
current-state ← Initial state of machine
loop
if End of input has been reached then
if current-state is an accept state then
Figure: Tracing the execution
return accept of FSA on some sheep talk.
else
return reject
elsif transition-table [current-state, tape[index]] is empty then
return reject
else
current-state ← transition-table [current-state, tape[index]]
index = index +1 IS 7118: NLP Unit-2 RE & FSA, 33
end Prof. R.K.Rao Bandaru
Sheep Talk FSA
• Adding fail state or sink state to sheep talk FSA

IS 7118: NLP Unit-2 RE & FSA, 34


Prof. R.K.Rao Bandaru
Key Points
• Deterministic means that at each point in
processing there is always one unique thing to
do (no choices).
• D-recognize is a simple table-driven interpreter
• The algorithm is universal for all unambiguous
regular languages.
– To change the machine, you simply change the
table.

IS 7118: NLP Unit-2 RE & FSA, 35


Prof. R.K.Rao Bandaru
Another Example of FSA
• An FSA for the words for English numbers 1-
99

IS 7118: NLP Unit-2 RE & FSA, 36


Prof. R.K.Rao Bandaru
Example-2 … contd.
• FSA for simple dollars and cents

IS 7118: NLP Unit-2 RE & FSA, 37


Prof. R.K.Rao Bandaru
Generative Formalisms

• Formal Languages are sets of strings, each string composed


of symbols from a finite symbol-set called an alphabet.
(such as ∑={a,b,!} for sheep language).
• Given a model m, then L(m) is a formal language
characterized by m. For sheep talk,
L(m) = {baa!, baaa!, baaaa!, baaaaa!, ….}
• Finite-state automata define formal languages (without
having to enumerate all the strings in the language)
• The term Generative is based on the view that we can run
the machine as a generator to get strings from the language.

IS 7118: NLP Unit-2 RE & FSA, 38


Prof. R.K.Rao Bandaru
Generative Formalisms
• FSAs can be viewed from two perspectives:
– Acceptors that can tell you if a string is in the
language
– Generators to produce all and only the strings in
the language

IS 7118: NLP Unit-2 RE & FSA, 39


Prof. R.K.Rao Bandaru
Non-Deterministic FSAs

In the figure shown below, self-loop is on state 2 instead of state 3.


Now, when we get to state 2, if we see ‘a’ we don’t know whether
to remain in state-2 or go on to state-3. An automata with decision
points like this are called non-deterministic FSAs ( or NFSAs)

IS 7118: NLP Unit-2 RE & FSA, 40


Prof. R.K.Rao Bandaru
Non-Determinism cont.
• Yet another technique
– Epsilon transitions
– Key point: these transitions do not examine or advance
the tape during recognition

IS 7118: NLP Unit-2 RE & FSA, 41


Prof. R.K.Rao Bandaru
Solution to the problem of Non-determinism

• Backup: When we reach a choice point, place a


marker to mark where we were in the input and
what state the automaton was in. In case it’s
found to be a wrong choice, back up and try
another path.
• Look-ahead: We could look ahead in the input to
help us decide which path to take.
• Parallelism: whenever we reach a choice point,
we could look at every alternative path in parallel.
IS 7118: NLP Unit-2 RE & FSA, 42
Prof. R.K.Rao Bandaru
Transition Table in NFSAs

• The guts of FSAs can b a ! e


ultimately be
0 1
represented as tables
1 2
2 2,3
If you’re in state
1 and you’re 3 4
looking at an a, 4
go to state 2

IS 7118: NLP Unit-2 RE & FSA, Prof. R.K.Rao Bandaru 43


An Algorithm for NFSA recognition

IS 7118: NLP Unit-2 RE & FSA, 44


Prof. R.K.Rao Bandaru
Non-Deterministic Recognition: Search
• In a ND FSA there exists at least one path through the
machine for a string that is in the language defined by the
machine.
• But not all paths directed through the machine for an accept
string lead to an accept state.
• No paths through the machine lead to an accept state for a
string not in the language.
• So success in non-deterministic recognition occurs when a
path is found through the machine that ends in an accept.
• Failure occurs when all of the possible paths for a given
string lead to failure.
IS 7118: NLP Unit-2 RE & FSA, 45
Prof. R.K.Rao Bandaru
Recognition as Search

• You can view this algorithm as a trivial kind of


state-space search.
• States are pairings of tape positions and state
numbers.
• Operators are compiled into the table
• Goal state is a pairing with the end of tape
position and a final accept state

IS 7118: NLP Unit-2 RE & FSA, 46


Prof. R.K.Rao Bandaru
Example

b a a a ! \

q0 q1 q2 q2 q3 q4
IS 7118: NLP Unit-2 RE & FSA, 47
Prof. R.K.Rao Bandaru
Tracing Execution of NDFA for sheep
talk Example : Depth First Search

IS 7118: NLP Unit-2 RE & FSA, 48


Prof. R.K.Rao Bandaru
Tracing Execution of NDFA for sheep
talk Example : Depth First Search

IS 7118: NLP Unit-2 RE & FSA, 49


Prof. R.K.Rao Bandaru
Tracing Execution of NDFA for sheep
talk Example : Depth First Search

IS 7118: NLP Unit-2 RE & FSA, 50


Prof. R.K.Rao Bandaru
Tracing Execution of NDFA for sheep
talk Example : Depth First Search

IS 7118: NLP Unit-2 RE & FSA, 51


Prof. R.K.Rao Bandaru
Tracing Execution of NDFA for sheep
talk Example : Depth First Search

IS 7118: NLP Unit-2 RE & FSA, 52


Prof. R.K.Rao Bandaru
Tracing Execution of NDFA for sheep
talk Example : Depth First Search

IS 7118: NLP Unit-2 RE & FSA, 53


Prof. R.K.Rao Bandaru
Tracing Execution of NDFA for sheep
talk Example : Depth First Search

IS 7118: NLP Unit-2 RE & FSA, 54


Prof. R.K.Rao Bandaru
Tracing Execution of NDFA for sheep
talk Example : Depth First Search

IS 7118: NLP Unit-2 RE & FSA, 55


Prof. R.K.Rao Bandaru
Tracing Execution of NDFA for sheep
talk Example : Breadth First Search

IS 7118: NLP Unit-2 RE & FSA, 56


Prof. R.K.Rao Bandaru
Key Points
• States in the search space are pairings of tape
positions and states in the machine.
• By keeping track of as yet unexplored states, a
recognizer can systematically explore all the
paths through the machine given an input.

IS 7118: NLP Unit-2 RE & FSA, 57


Prof. R.K.Rao Bandaru
Why Bother?
• Non-determinism doesn’t get us more formal
power and it causes headaches so why bother?
– More natural (understandable) solutions

IS 7118: NLP Unit-2 RE & FSA, 58


Prof. R.K.Rao Bandaru
Equivalence
• Non-deterministic machines can be converted
to deterministic ones with a fairly simple
construction
• That means that they have the same power;
non-deterministic machines are not more
powerful than deterministic ones in terms of
the languages they can accept

IS 7118: NLP Unit-2 RE & FSA, 59


Prof. R.K.Rao Bandaru
ND Recognition
• Two basic approaches (used in all major
implementations of regular expressions, see
Friedl 2006)
1. Either take a ND machine and convert it to a D
machine and then do recognition with that.
2. Or explicitly manage the process of recognition
as a state-space search (leaving the machine as
is).

IS 7118: NLP Unit-2 RE & FSA, 60


Prof. R.K.Rao Bandaru
Regular Languages
• The class of languages that are definable by
regular expression is exactly same as the class
of languages that are characterized by finite-
state automata(whether deterministic or non-
deterministic)
• Therefore these languages are called regular
languages.

IS 7118: NLP Unit-2 RE & FSA, 61


Prof. R.K.Rao Bandaru
Regular Languages
• Languages that meet the following properties
are Regular Languages.

IS 7118: NLP Unit-2 RE & FSA, 62


Prof. R.K.Rao Bandaru
Regular Languages
• Regular languages are also closed under the
following operations:

IS 7118: NLP Unit-2 RE & FSA, 63


Prof. R.K.Rao Bandaru
Compositional Machines
• Formal languages are just sets of strings
• Therefore, we can talk about various set operations (intersection,
union, concatenation)
• This turns out to be a useful exercise

Figure: Automata for the base case (no operators) for the induction showing
that an regular expression can be turned into an equivalent automaton.

IS 7118: NLP Unit-2 RE & FSA, 64


Prof. R.K.Rao Bandaru
Union of Two FSAs

Figure: The union of two FSA’s

IS 7118: NLP Unit-2 RE & FSA, 65


Prof. R.K.Rao Bandaru
Concatenation of Two FSAs

Figure: The concatenation of two FSA’s.

IS 7118: NLP Unit-2 RE & FSA, Prof. R.K.Rao Bandaru 66


Closure(Kleen*) of FSAs
• Iteration

Figure: The closure(Kleene *) of a FSA.

IS 7118: NLP Unit-2 RE & FSA, 67


Prof. R.K.Rao Bandaru
Negation
• Construct a machine M2 to accept all strings
not accepted by machine M1 and reject all the
strings accepted by M1
– Invert all the accept and not accept states in M1
• Does that work for non-deterministic
machines?

IS 7118: NLP Unit-2 RE & FSA, 68


Prof. R.K.Rao Bandaru
End of Unit-2

???

You might also like