Grepping for Exact Strings

SkySmart · April 12, 2009, 6:40pm

ok, apparently this is a very difficult question to answer based on my searches on google that came up fruitless.

what i want to do is grep through a file for words that match a specified string.

but the thing is, i keep getting all words in the file that have the string in them.

say for instance all I want from a file is the word "airplane". I want to replace this word "airplane" with "helicopter".

the problem is, in the file i'm using, there are several occurrences of the word "airplane". example: "airplanewomen, airplaneman, sweetairplane, sourairplane".

When i go to look for the word airplane, all these other strings come up, and I dont want them.

does anyone know of a way to grep out just the occurences of only the word "airplane" from a file?

by the way, grep -w doesn't work, neither does grep -x. neither those grep "^string".

i'm using ubuntu 8.

krabu · April 12, 2009, 6:54pm

try grep " airplane " or grep " airplane." for end of line

Gunther · April 12, 2009, 7:26pm

grep "^airplane$"

SkySmart · April 12, 2009, 7:43pm

i've actually tried this but nothing shows up.

do you know how sed can be used for this?

SkySmart · April 12, 2009, 7:43pm

this seems to work. but how do i incorporate it into sed to replace just the occurences of airplane?

rubionis · April 12, 2009, 7:56pm

try this ...

sed 's:\<airplane\>:helicopter:g'  filename

SkySmart · April 12, 2009, 8:13pm

your code seems like it might do the trick. but let me give a bit more explanation of the problem im having.

sometimes in the file i'm using, there are words like this: db-airplane, db-12.airplane.

in cases like that, your code turns the words into db-helicopter, db-12.helicopter.

you see what i'm saying? I dont want the code to touch anything in the file that isn't "airplane", alone. all i want is to replace places were the airplane stands alone.

thanks you so much for your suggestions

rubionis · April 12, 2009, 9:27pm

I see your situation now..., not the perfect solution, but this might get you going:

sed 's: airplane : helicopter :g' filename 

# does not handle the case when line starts or ends with "airplane"

I'd switch to awk, perl,... for a more granular pattern check.

giannicello · April 12, 2009, 9:31pm

sed 's/[ ]airplane[ ]/ helicoptor /g' myfile

ghostdog74 · April 12, 2009, 10:06pm

go through each word, and testing them against just the word "airplane"

awk '
{
  for(i=1;i<==NF;i++){
     if($i == "airplane"){
           print "Found airplane: " $0
     }
  }
}
' file

krabu · April 13, 2009, 6:46am

try this
sed 's/' airplane '/' helicopter '/g'
and
sed 's/' airplane/.'/' helicopter/.'/g'

quirkasaurus · April 13, 2009, 12:15pm

the problem is . . . that your definition of word delimiter is different from sed's.
sed's includes any punctuation or any whitespace.

so.... we might do well with more information.

Are all these words in a comma-separated list, or similar?

SkySmart · April 13, 2009, 2:10pm

Some words are in comma-separated list, while others aren't. That is what I guess makes this a very complicated task to complete. Never imagined it'd be this hard to replace words in a file.

quirkasaurus · April 13, 2009, 2:57pm

well, anyway, a fun problem. Skysmart, try uploading a copy of the file and
let us all have a look.

One solution is to do something like this:

find all words that match any or all of your string.
( that is, the db-airplane, db-airplane.something, airplane ) etc...

Then, for any word that is NOT exactly airplane, go back through
the file and modify it first to something else.... for instance....
the capitalized version or tack-on airplaneDONOTEDIT to it.

Then, edit '\<airplane\>' as usual.

But to provide you the code for that, it would be helpful to have your original file.

I give the total estimate as 1 SMOP.

quirkasaurus · April 13, 2009, 3:15pm

Here you go:

My input file "a" contains:

my-do-not-change-airplane and other frogs
airplane.seriously change me
sometimes in the file i'm using,
there are words like this: db-airplane, db-12.airplane.
in cases like that, your code turns
the words into db-helicopter, db-12.helicopter.
airplane,frogs,somewerirdairplane buggly buggly
aardvark,chameleon,airplane,dugong,basilisk
aardvark,chameleon,dugong,basilisk,airplane

My script, "clam" looks like this:

#!/bin/ksh
#----------------------------------------------------------------------#
# Find funky occurances of airplane...                                 #
#----------------------------------------------------------------------#
cat a |
#----------------------------------------------------------------------#
# Translate any and word delimiters to newlines...                     #
#----------------------------------------------------------------------#
  tr '[. ,      ]' '\012\012\012\012' |
  grep airplane |
#----------------------------------------------------------------------#
# Grep OUT our target word to change...                                #
#----------------------------------------------------------------------#
  grep -v "^airplane$" |
  sort -u |
while read pattern ; do
#----------------------------------------------------------------------#
# Create list of sed commands...                                       #
#----------------------------------------------------------------------#
  print "s/$pattern/${pattern}DONOTEDIT/g;"
done > sed.file
#----------------------------------------------------------------------#
# Finish our little sed script. Remove our DONOTEDIT strings.          #
#----------------------------------------------------------------------#
cat << EOF >> sed.file
s/\<airplane\>/HELICOPTER/g;
s/DONOTEDIT//g;
EOF
#----------------------------------------------------------------------#
# ... and voila....                                                    #
#----------------------------------------------------------------------#
sed -f sed.file a

... and the output is:

my-do-not-change-airplane and other frogs
HELICOPTER.seriously change me
sometimes in the file i'm using,
there are words like this: db-airplane, db-12.HELICOPTER.
in cases like that, your code turns
the words into db-helicopter, db-12.helicopter.
HELICOPTER,frogs,somewerirdairplane buggly buggly
aardvark,chameleon,HELICOPTER,dugong,basilisk
aardvark,chameleon,dugong,basilisk,HELICOPTER

quirkasaurus · April 13, 2009, 3:28pm

I see that the script also changed db-12.airplane to db-12.HELICOPTER.

You can modify this action, one of two ways:

remove the . from the "tr" translation list, ( also removing one of the \012 sequences.
btw, you need to match word-delimiter count to \012 count. )

Or simply intercept the sed.file and modify its contents first.

quirkasaurus · April 13, 2009, 3:37pm

anyways.... removed the . word delimiter. here's the code:

#!/bin/ksh

#----------------------------------------------------------------------#
# Find funky occurances of airplane...                                 #
#----------------------------------------------------------------------#
cat a |

#----------------------------------------------------------------------#
# Translate any and word delimiters to newlines...                     #
#----------------------------------------------------------------------#
  tr '[ ,       ]' '\012\012\012' |
  grep airplane |

#----------------------------------------------------------------------#
# Grep OUT our target word to change...                                #
#----------------------------------------------------------------------#
  grep -v "^airplane$" |
  sort -u |
while read pattern ; do

#----------------------------------------------------------------------#
# Create list of sed commands...                                       #
#----------------------------------------------------------------------#
  print "s/$pattern/${pattern}DONOTEDIT/g;"

done > sed.file

#----------------------------------------------------------------------#
# Finish our little sed script. Remove our DONOTEDIT strings.          #
#----------------------------------------------------------------------#
cat << EOF >> sed.file
s/\<airplane\>/HELICOPTER/g;
s/DONOTEDIT//g;
EOF

#----------------------------------------------------------------------#
# ... and voila....                                                    #
#----------------------------------------------------------------------#
sed -f sed.file a

Gunther · April 19, 2009, 12:32am

I spent some time trying to get into awk. So, here is my solution:

awk 'BEGIN {FS="[.,\ ]"; OFS=" ";} {for(i=1; i<=NF; ++i) {if($i=="airplane") {sub(/airplane/, "helicopter", $i);}} print $0;}' airplane > airplane.new

There's a small flaw in it, though, due to the fact that the input field delimiter (FS) is a regular expression instead of a static string/character. Unfortunately, I didn't find a way to "preserve" the actual input delimiter for output (OFS) but had to set it to a static (whitespace) character.

Maybe one of you guys knows a way how to achieve this.

--Gunther

colemar · April 19, 2009, 8:36am

Skysmart, you have to be more precise about what is your definition for "airplane alone".
Given this definition, it should be feasible to do the replacement with sed alone, no pun intended.

Usually the definition for "word" is: a sequence of alfanumeric characters surrounded by punctuation characters or by the start/end of the line.
Usually the definition for alfanumeric is: 0,1,...9,a,b,...z,A,B,...Z,_
Usually the definition for punctuation is: any non alfanumeric character.

I understand that you assume that the character "-" is not punctuation.
Is there any other character that you deem not punctuation?

sed -r 's/(^|[^0-9a-zA-Z_-])airplane([^0-9a-zA-Z_-]|$)/\1helicopter\2/g'

Tested with GNU sed version 4.1.5.

-r is to make sed understand extended regular expression syntax.
^ stands for the start of the line and $ stands for the end of the line; they are "anchors" and do not represent a character.
| is the alternating operator, that is multiple choices.
[] is a class of characters; this expression represents a single character; [^...] is the class of characters that are not in the class [...].
\1 and \2 are backreferences and they are required to keep the characters surrounding the airplane.

rubin · April 19, 2009, 9:53am

If I correctly understand the OP's request, here goes another way with perl,

In my test file, highlighted words contain the required pattern, but only those surrounded by whitespace, or that are found at the start or end of the record (in bold) are to be replaced:

$ cat file
data data data data airplane. data data data data data data airplane data data

data airplane. data airplanewomen, airplaneman, sweetairplane, sourairplane.

airplane data data data data data data data data data db-airplane, db-12.airplane.

data data airplane, data data data data data data wantairplane data airplane

data data data data airplane data data data data data .airplane data data data

perl -ne 's/(^|(?<=\s))airplane((?=\s)|$)/helicopter/g; print' file

Output:

data data data data airplane. data data data data data data helicopter data data

data airplane. data airplanewomen, airplaneman, sweetairplane, sourairplane.

helicopter data data data data data data data data data db-airplane, db-12.airplane.

data data airplane, data data data data data data wantairplane data helicopter

data data data data helicopter data data data data data .airplane data data data

Tested with perl v5.10.0 (Linux) and v5.8.4 (Solaris 10).