0% found this document useful (0 votes)
39 views

Amazon Food Review Notes

The document describes a project to classify Amazon food reviews into positive ("best") and negative ("worth") categories. It discusses preprocessing the review data by removing duplicates, converting ratings to positive/negative labels, and converting text to word vectors. Word vectors are created using techniques like bag-of-words, term frequency-inverse document frequency (TF-IDF), and word2vec to represent text as numerical features for machine learning models to classify review sentiment.

Uploaded by

The Chosen One
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

Amazon Food Review Notes

The document describes a project to classify Amazon food reviews into positive ("best") and negative ("worth") categories. It discusses preprocessing the review data by removing duplicates, converting ratings to positive/negative labels, and converting text to word vectors. Word vectors are created using techniques like bag-of-words, term frequency-inverse document frequency (TF-IDF), and word2vec to represent text as numerical features for machine learning models to classify review sentiment.

Uploaded by

The Chosen One
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

AMAZON FOOD REVIEWS:

Amazon food reviews project is used to differentiate the


reviews of the customer into best good and the worth.
Aim of the project is to create a system in order to classify
ratings 4 and 5 as the best and rating 3 as moderate and 1
and 2 as the worth.

Data was collected by the team snap (Stanford network


analysis project).
Attribute of the reviews are collected as per given picture:

The attributes are :


1.Id
2.product id
3.user id
4.profile name
5.helpfulness numerator
6.helpfulness denominator
7.score
8.time
9.summary
10.text
Converting the data into machine learning problem:

We consider the rating as our primary source as the positive


and negative where 4,5 -positive and 3-neutral further down
the review negative.

8 attribute information we will use to predict the user


information and raiting.
Text is used as the primary information for the review.
Libraries that are imported:

Now we created a connection to the sql data base by using


sql.connect:
Now we use pandas as the sql query where the following sql
query is used to directly fetch data from the sql database.

We use pandas as the alias name and use sql select query in
order to fetch data from the connection ,we fetch all the data
except 3 ,where we call the database by the command ‘con’.

Then we define a function to classify the rating s positive and


negative and if the rating is <3 we classify it as negative and
rating greater than 3 then positive rating.
Now we fetch the score column from the filtered table data
Now we can see first we fetch the score/rating from the
filtered data table and store it in the actual score dataframe.
Next we use ‘map’ function of python on the actual score
column in order to classify positive and negative.
Then we replace the score column with positive or negative
on the filtered data table we three lines of code as we see
above.

Now we print the data with the above command.

Now we have replaced the scores as positive and negative


from the numbers.
Time is stored in the form of unix tym stamp.
Now we clean the data :Removal of data duplication;

All the reviews are given by one user only ,we use the above
the following s ql command from the table review and sorted
according to the product id.
Finding out the error:

Now we notice the same user enters two reviews at the same
tym stamp which is impossible.So we came to know the same
product with different varieties collect the same review for
all the product varities .
Now we need to dedupe the datas.
First we sort the data using quick sort through the product id
column and store it in a data frame known as sorted_data.

Now we drop the duplicates where we categorise the


duplicates with same userid,profilename,time,text and then
we keep the first one and remove all of them as we see
below deop duplicate command;
Pandas deduplication function:

Now another anomaly in the data is that :


The helpfulness numerator < helpfulness denominator since
the following helpfulness numerator=people who said yes
Helpfulness denominator= people who said yes+false.
So we remove the data:
Command:
Now we convert text to vector :

Now we take all the positive and negative reviews in the


same geometrical plane and try to create a hyperspace in
them such that the point above the plane is taken as a
positive review and the point below the plane is taken as
negative review.
Now we first know we convert the review text into d-
dimension data and then find a plane to separate them.
But first we need to know find out how to convert the text
pattern in d-dimensional vector.
Now let r1,r2,r3 are three review and the similarity of
(r1,r2)>(r1,r3) then practically the distance between r1 and
r2 vector must be less. That means the length must be small.
Strategies of converting text to vector:

1.Bag of words:
Basically in this technique we create a dictionary or set to
identify the similar texts of unique words.

We call it as a corpus.And the reviews as a document.


2.Now we construct the matrix vector:

In this case of bag of words create a vector of d-dimensions


and then we create a 1 row matrix where each slot contains
the words as shown above and the number inside the slot is
the no.of times the word repeated in the review is used .
Now the bag of words techniques will create a sparse vector
and where most of the words used from the vector box will
be zero ,the frequency of the used words from the vector box
will be zero. Hence the vector will be called as sparse vector.
where each word represent a different dimension.
The more similar frequency of the more the reviews or the
vectors are similar and closer to each other.
Now after constructing the matrix in the vector form we now
do find the distance and it comes as square root of 3 .Even
though the maximum distance would have come 9 but we
got much less which signifies the vectors distance are very
less.But the English meaning is completely opposite.
This is known as binary bag of words. As shown above
It only return 0 or 1.
Now the distance between binary bag of words roughly equal
to no.of differing words.
Stop removal, tokenisation, lemmatization:
Stop words:

The stop words are the words that are not meaningful .But in
some cases these can be highly meaningful and as soon as
the stop words are removed the vector becomes
smaller,efficient and meaningful.

We are using libraries :


2.lower case. to upper case
3.Stemming:
Now we are using the stemming where we can categorize the
words like tasty,tasteful,taste into one.(coverting them to
root form).
And for the following algorithms porter stemmer and
snowball stemmer are used and snowball stemmer is much
more efficient.

4.Lemmatization: breaking up a sentence into words

Bag of words demerits:


1.we are not using semantic meaning of words.(does not
guarantee the meaning of words).
Uni grams ,Bi grams,N grams:
Uni grams : In the unigram models we use mainly each word
as a unique dimension.
Bi gram: In the bi-gram model we mainly use the pair of
consecutive words as a unique dimension in the d-
dimension vector.
Tri grams: three consecutive words as a unique dimension
in d-dimension vector.

We are basically using n-grams because the bag of words


technique basically deleted the sequence information.
So n-grams some how are able to preserve the sequence
information.
Term frequency and inverse document frequency:

Term frequency is defined as ((no.of times Wi occur in Rj)


/(total no.of words in Rj)). How much time Wi occur in
document Rj.
Inverse document frequency:

Idf is mainly defined as IDF(Wi,Dc)=log(N/ni)


Here Wi stands for the particular word and Dc stands for the
dataset corpus.
1.IDF >=0
2.ni<=N
Now applying the tdf and idf together:
More importance is given to rarer words in my Dc
And more importance if a word is frequent in a document
review.

-Why do we use log in IDF?


Understanding Zipf’s law:
Zipf’s law follows power law distribution:
WORDS TO VECTOR : Deep learning concept

*W2 vec is a tensorflow deep learning model where we use


dense d-dimensional vector for representation of words and
its not a sparse vector (that means most of the dimensions
value is not zero).
*Now if we considers 3 words tasty , delicious , baseball .Now
we expect that tasty and delicious are nearly same then the
distance between them should be minimal.And distance
between the baseball must be the highest. And that exactly
what it happens in the tensorflow library of words 2 vector.
*And the following the vectors are semantically similar
according to the English language.
Now if we take vector man as a point and woman as another
point in 300 dimensional space then the vector constructed
by joining the points will be parallel to the vector constructed
in between king and queen as shown above that means the
words 2 vec somehow differentiates between genders and
make parallel relation in high dimensional space.
Similarly we can take the case in verb and country capital
too.
Mainly google news has been used to train the TensorFlow
learning model word2vec where we can represent 300
dimensional data at once at highest accuracy.
Similarly the word2vec model can be implemented in our
review system.

*Now understanding conversion of sentences into d-


dimensional vector using average word2vec algorithm.
Now in order to built sentence vector we first use w2vec on
words add them and find their average.
Its works roughly well but not perfect.
Bag of words codes:

You might also like