We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 5
Now we're gonna talk about text
processing. The most basic and fundamental
tool we have for text processing is the regular expression. And regular expression is a formal language for specifying text strings. So let's suppose that we're looking for woodchucks in a text document, Woodchucks can be expressed in a number of ways. We could have a singular woodchuck, we could have the plural S at the end. We could have a capital letter at the beginning, or a lower case, and any combination of these. So we're gonna need tools to deal with this problem. So the simplest, fundamental tool in a regular expression is the disjunction. The square brackets in a regular expression pattern mean any letter inside these square brackets. So. Lowercase w, capital W. Square bracket means either a lowercase w or a capital W. So, we can combine that with woodchuck, to match lowercase or uppercase woodchuck. And, similarly with digits, one, two, three, four and so on. 56789. Zero matches any digit. Now that was kinda annoying to write. So, we'd like to do instead, is have little ranges. The range zero through dash nine so square bracket zero dash nine means any character inside that range. And the range A-Z, means any character, between A, a capital letter, between A and Z. So let's see if we can see how that works. So here's an example of a Red X [inaudible], a little tool we're going to use for regular expressions searching, and we have here a little text from Dr. Seuss. We looked, then we saw him stepping on the mat. We looked and we saw him, the cat in the hat. And let's try our, our, disjunctions, so we can have, the capital W and the lower case w. And X, excuse me, a capital W and a lower case w. And that's gonna match, as you can see, the capital W's and the lower case w's just fine. Or we could have all the E's and all the M's. And that's gonna match all the E's and the M's. Or in our ranges, we can have all the capital letters. Here's all the capital letters being matched. We can have all the lower case letters, there's a lot of lower case letters there or we can match all of the alphanumeric characters, think for a second how to match all of the alphanumeric characters. We can have. Or we can simply match some of the non-alphanumeric characters. We can have space, an exclamation point in our square brackets and that is gonna match, as you can see, some of the non-alphabetic characters. Okay, so let's go on. Another kind of thing we might wanna do in our regular expressions is negation in our dis-junction. We might wanna say we don't want some kind of, set of letters. So for example, we might wanna say, not. A capital letter. And we can do that by saying, carrot, a through z, in our square brackets. Carrot, when it occurs right after the square brackets, means not. Carrot a through z, not a capital letter. Caret a, little a. Means neither a capital a or a little a. And carrot, E, carrot, means not an E, and not a carrot. So you can see that the carrot, when it occurs right after the square bracket, means not. But later on means simply just a carrot. So let's take a look at that. [sound]. So we can try, finding all of the non-capital letters. Here's all the non capital letters. How ?bout all the non exclamation points? [sound]. Most things, and the non-alpha numerics. [sound]. Sorry, it didn't, non-alphabetics. And there's just the spaces and exclamation points, as you can see. How bout looking for a carrot? Any carrots in here? There are none. So there are no carrots [inaudible], nothing matches. Another type of disjunction which can be used for longer strings is the pipe symbol, sometimes called or, or pipe, or just disjunction. So groundhog or woodchuck can be, can, will mean either the string groundhog or the string wood, woodchuck. So we can use the pipe symbol sometimes for the same thing as the square bracket, so A pipe B, pipe C. It's the same as square bracket ABC, and we can combine these things. We can combine the square brackets in the pipe so we can have groundhogger woodchuck but use our square bracket for expressing capitalization at the beginning. And we can see that in our, in our little example. We can have looked or step. And sure enough, there, the words looked and step are both highlighted. Or we can have distinction of just random things, they don't have to be words. We can have all of the ats. Excuse me, all of the ats, and all of the [inaudible]. And, any random string is fine. Finally, there's sets of special characters that are very important in regular expressions. The question mark means that the previous character is optional. So the question mark after this U here, I mean, will match the word color, with or without the u. With, without the U, with the U. Then there are the two cleaning operators named for Steven cleaning. [inaudible] star matches zero or more of the previous characters. So here is the star. It matches zero or more Os. So we have one O followed by zero or other Os. So there's the initial O and zero other Os. And then our H!. Here's our initial O followed by one O and then the H, and so on. Two, three, and so on. Sometimes, more simple, we can have the, the clean plus, so, that means one or more of the previous characters. So, there's our O followed by the plus, meaning one or more O. So there's one O, there's two O's, three O's, and so on. And the dot, is a special character meaning any character, so BEG.N can match ?begin', 'begun', 'BEG3N', it matches anything. [sound] And finally two special characters... The caret matches the beginning of the line. So caret, capital A through Z matches a capital letter at the beginning of the line. The dollar sign matches the end of a line. So A through Z dollar matches the end of a line, like the capital letter at the end of the line. And then if we want to talk about a period, since periods are a special character, we have to escape them. Back slash period means a period. So a period by itself means any character. Back slash period means a real period. Let's go look at some of these. So here's the letter O. Here's zero or, it's, like, make, let's make it one or more O first. Here's one or more O. So there's 1-O over here, and two O's over here. [sound] And, and now lets look at, at beginnings and ends of lines. Here is capital letters at the beginning of a line. Here's. Capital letters at the end of a line. Oh, there aren't any. Here is punctuation at the end of a line. There's all the exclamation points at the end of a line. [sound]. Here's all the periods. Remember we have to backslash our periods. And, if we didn't backslash the period, we would get all the characters, 'cause period matches everything. All right, let's do one more example. Let's look at, this little sentence here, the other one there, the Blithe one. Let's, let's walk through how to search for words. Let find the word, all the word the in this little passage. So think for yourself how you would do this. Well, the simplest thing you might do is just type the t-h-e. And, that does a good job of finding this, this the here. Let's find this the and that the. But it misses these two thes. It also finds some other things. Let's fix the first problem. How do we not only get the thes in the middle but those capitalized thes at the beginning, well we're going to use our Are disjunction. And sure enough, that correctly now matches the two, beginning of line. The, thes. But you notice that our pattern, although it now captures something it missed before, it still captures things it shouldn't be capturing, other. There and blithe. So, we need to augment our patterns. So, how are we going to augment our pattern. We really want the when it's, there's not in alphabetic character around. We need a space or punctuation or something non-alphabetic. Let's just say non-alphabetic afterwards. Great. That get's rid of other and there. Doesn't solve blithe because blithe has an alphabetic character before it. So, let's go fix blithe. By saying non alphabetic before ad ether. There we go. Now we've found all of our, all of our the's. So. We, we looked for the. We noticed it missed capitalized examples, so we, we added some. We made our pattern more, more, ris- more, expansive. We increased the yield of our pattern. But that incorrectly returns more things. So then we [inaudible], we need to make the pattern more precise. By, by specifying more thing this process. And, that we went through, is based on fixing two kinds of errors. One is matching strings we shouldn't have matched. We matched there, we matched other. So that's trying to, that's solving the problem of false positives, or they're called type one errors. We were maxing things we shouldn't match. And the other thing we went through is, to solve the problem of not matching things we should have matched. So we missed those capital thes. And that's dealing with the problem of false negatives, or type two errors. And it turns out in nat-, in natural language processing, we're constantly dealing with these two classes of errors. So reducing the error rate in any application, and we're gonna see this again and again, in this course, involves two antagonistic efforts. We're increasing the accuracy or precision, which helps us minimize those false positives. Or we're increasing our coverage, or technically called recall, minimizing our false negatives. So in summary, regular expressions play a surprisingly large role in text processing. And the sophisticated sequences of regular expressions that we've seen very simple versions of are often the first model for almost any text-processing task. For harder tasks, we're often going to be using, and we'll introduce these, these machine learning classifiers that are much more powerful. But it turns out even then, regular expressions are used as features in the classifiers and are very useful at capturing generalizations. So you're going to be returning again and again to regular expressions.