0% found this document useful (0 votes)
24 views5 pages

01 Regular Expressions 11-25

Uploaded by

idhitappu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views5 pages

01 Regular Expressions 11-25

Uploaded by

idhitappu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 5

Now we're gonna talk about text

processing. The most basic and fundamental


tool we have for text processing is the
regular expression. And regular expression
is a formal language for specifying text
strings. So let's suppose that we're
looking for woodchucks in a text document,
Woodchucks can be expressed in a number of
ways. We could have a singular woodchuck,
we could have the plural S at the end. We
could have a capital letter at the
beginning, or a lower case, and any
combination of these. So we're gonna need
tools to deal with this problem. So the
simplest, fundamental tool in a regular
expression is the disjunction. The square
brackets in a regular expression pattern
mean any letter inside these square
brackets. So. Lowercase w, capital W.
Square bracket means either a lowercase w
or a capital W. So, we can combine that
with woodchuck, to match lowercase or
uppercase woodchuck. And, similarly with
digits, one, two, three, four and so on.
56789. Zero matches any digit. Now that
was kinda annoying to write. So, we'd like
to do instead, is have little ranges. The
range zero through dash nine so square
bracket zero dash nine means any character
inside that range. And the range A-Z,
means any character, between A, a capital
letter, between A and Z. So let's see if
we can see how that works. So here's an
example of a Red X [inaudible], a little
tool we're going to use for regular
expressions searching, and we have here a
little text from Dr. Seuss. We looked,
then we saw him stepping on the mat. We
looked and we saw him, the cat in the hat.
And let's try our, our, disjunctions, so
we can have, the capital W and the lower
case w. And X, excuse me, a capital W and
a lower case w. And that's gonna match, as
you can see, the capital W's and the lower
case w's just fine. Or we could have all
the E's and all the M's. And that's gonna
match all the E's and the M's. Or in our
ranges, we can have all the capital
letters. Here's all the capital letters
being matched. We can have all the lower
case letters, there's a lot of lower case
letters there or we can match all of the
alphanumeric characters, think for a
second how to match all of the
alphanumeric characters. We can have. Or
we can simply match some of the
non-alphanumeric characters. We can have
space, an exclamation point in our square
brackets and that is gonna match, as you
can see, some of the non-alphabetic
characters. Okay, so let's go on. Another
kind of thing we might wanna do in our
regular expressions is negation in our
dis-junction. We might wanna say we don't
want some kind of, set of letters. So for
example, we might wanna say, not. A
capital letter. And we can do that by
saying, carrot, a through z, in our square
brackets. Carrot, when it occurs right
after the square brackets, means not.
Carrot a through z, not a capital letter.
Caret a, little a. Means neither a capital
a or a little a. And carrot, E, carrot,
means not an E, and not a carrot. So you
can see that the carrot, when it occurs
right after the square bracket, means not.
But later on means simply just a carrot.
So let's take a look at that. [sound]. So
we can try, finding all of the non-capital
letters. Here's all the non capital
letters. How ?bout all the non exclamation
points? [sound]. Most things, and the
non-alpha numerics. [sound]. Sorry, it
didn't, non-alphabetics. And there's just
the spaces and exclamation points, as you
can see. How bout looking for a carrot?
Any carrots in here? There are none. So
there are no carrots [inaudible], nothing
matches. Another type of disjunction which
can be used for longer strings is the pipe
symbol, sometimes called or, or pipe, or
just disjunction. So groundhog or
woodchuck can be, can, will mean either
the string groundhog or the string wood,
woodchuck. So we can use the pipe symbol
sometimes for the same thing as the square
bracket, so A pipe B, pipe C. It's the
same as square bracket ABC, and we can
combine these things. We can combine the
square brackets in the pipe so we can have
groundhogger woodchuck but use our square
bracket for expressing capitalization at
the beginning. And we can see that in our,
in our little example. We can have looked
or step. And sure enough, there, the words
looked and step are both highlighted. Or
we can have distinction of just random
things, they don't have to be words. We
can have all of the ats. Excuse me, all of
the ats, and all of the [inaudible]. And,
any random string is fine. Finally,
there's sets of special characters that
are very important in regular expressions.
The question mark means that the previous
character is optional. So the question
mark after this U here, I mean, will match
the word color, with or without the u.
With, without the U, with the U. Then
there are the two cleaning operators named
for Steven cleaning. [inaudible] star
matches zero or more of the previous
characters. So here is the star. It
matches zero or more Os. So we have one O
followed by zero or other Os. So there's
the initial O and zero other Os. And then
our H!. Here's our initial O followed by
one O and then the H, and so on. Two,
three, and so on. Sometimes, more simple,
we can have the, the clean plus, so, that
means one or more of the previous
characters. So, there's our O followed by
the plus, meaning one or more O. So
there's one O, there's two O's, three O's,
and so on. And the dot, is a special
character meaning any character, so BEG.N
can match ?begin', 'begun', 'BEG3N', it
matches anything. [sound] And finally two
special characters... The caret matches
the beginning of the line. So caret,
capital A through Z matches a capital
letter at the beginning of the line. The
dollar sign matches the end of a line. So
A through Z dollar matches the end of a
line, like the capital letter at the end
of the line. And then if we want to talk
about a period, since periods are a
special character, we have to escape them.
Back slash period means a period. So a
period by itself means any character. Back
slash period means a real period. Let's go
look at some of these. So here's the
letter O. Here's zero or, it's, like,
make, let's make it one or more O first.
Here's one or more O. So there's 1-O over
here, and two O's over here. [sound] And,
and now lets look at, at beginnings and
ends of lines. Here is capital letters at
the beginning of a line. Here's. Capital
letters at the end of a line. Oh, there
aren't any. Here is punctuation at the end
of a line. There's all the exclamation
points at the end of a line. [sound].
Here's all the periods. Remember we have
to backslash our periods. And, if we
didn't backslash the period, we would get
all the characters, 'cause period matches
everything. All right, let's do one more
example. Let's look at, this little
sentence here, the other one there, the
Blithe one. Let's, let's walk through how
to search for words. Let find the word,
all the word the in this little passage.
So think for yourself how you would do
this. Well, the simplest thing you might
do is just type the t-h-e. And, that does
a good job of finding this, this the here.
Let's find this the and that the. But it
misses these two thes. It also finds some
other things. Let's fix the first problem.
How do we not only get the thes in the
middle but those capitalized thes at the
beginning, well we're going to use our Are
disjunction. And sure enough, that
correctly now matches the two, beginning
of line. The, thes. But you notice that
our pattern, although it now captures
something it missed before, it still
captures things it shouldn't be capturing,
other. There and blithe. So, we need to
augment our patterns. So, how are we going
to augment our pattern. We really want the
when it's, there's not in alphabetic
character around. We need a space or
punctuation or something non-alphabetic.
Let's just say non-alphabetic afterwards.
Great. That get's rid of other and there.
Doesn't solve blithe because blithe has an
alphabetic character before it. So, let's
go fix blithe. By saying non alphabetic
before ad ether. There we go. Now we've
found all of our, all of our the's. So.
We, we looked for the. We noticed it
missed capitalized examples, so we, we
added some. We made our pattern more,
more, ris- more, expansive. We increased
the yield of our pattern. But that
incorrectly returns more things. So then
we [inaudible], we need to make the
pattern more precise. By, by specifying
more thing this process. And, that we went
through, is based on fixing two kinds of
errors. One is matching strings we
shouldn't have matched. We matched there,
we matched other. So that's trying to,
that's solving the problem of false
positives, or they're called type one
errors. We were maxing things we shouldn't
match. And the other thing we went through
is, to solve the problem of not matching
things we should have matched. So we
missed those capital thes. And that's
dealing with the problem of false
negatives, or type two errors. And it
turns out in nat-, in natural language
processing, we're constantly dealing with
these two classes of errors. So reducing
the error rate in any application, and
we're gonna see this again and again, in
this course, involves two antagonistic
efforts. We're increasing the accuracy or
precision, which helps us minimize those
false positives. Or we're increasing our
coverage, or technically called recall,
minimizing our false negatives. So in
summary, regular expressions play a
surprisingly large role in text
processing. And the sophisticated
sequences of regular expressions that
we've seen very simple versions of are
often the first model for almost any
text-processing task. For harder tasks,
we're often going to be using, and we'll
introduce these, these machine learning
classifiers that are much more powerful.
But it turns out even then, regular
expressions are used as features in the
classifiers and are very useful at
capturing generalizations. So you're going
to be returning again and again to regular
expressions.

You might also like