0% found this document useful (0 votes)

39 views

Amazon Food Review Notes

The document describes a project to classify Amazon food reviews into positive ("best") and negative ("worth") categories. It discusses preprocessing the review data by removing duplicates, converting ratings to positive/negative labels, and converting text to word vectors. Word vectors are created using techniques like bag-of-words, term frequency-inverse document frequency (TF-IDF), and word2vec to represent text as numerical features for machine learning models to classify review sentiment.

Uploaded by

The Chosen One

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views

Amazon Food Review Notes

Uploaded by

The Chosen One

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

AMAZON FOOD REVIEWS:

Amazon food reviews project is used to differentiate the

reviews of the customer into best good and the worth.
Aim of the project is to create a system in order to classify
ratings 4 and 5 as the best and rating 3 as moderate and 1
and 2 as the worth.

Data was collected by the team snap (Stanford network

analysis project).
Attribute of the reviews are collected as per given picture:

The attributes are :

1.Id
2.product id
3.user id
4.profile name
5.helpfulness numerator
6.helpfulness denominator
7.score
8.time
9.summary
10.text
Converting the data into machine learning problem:

We consider the rating as our primary source as the positive

and negative where 4,5 -positive and 3-neutral further down
the review negative.

8 attribute information we will use to predict the user

information and raiting.
Text is used as the primary information for the review.
Libraries that are imported:

Now we created a connection to the sql data base by using

sql.connect:
Now we use pandas as the sql query where the following sql
query is used to directly fetch data from the sql database.

We use pandas as the alias name and use sql select query in
order to fetch data from the connection ,we fetch all the data
except 3 ,where we call the database by the command ‘con’.

Then we define a function to classify the rating s positive and

negative and if the rating is <3 we classify it as negative and
rating greater than 3 then positive rating.
Now we fetch the score column from the filtered table data
Now we can see first we fetch the score/rating from the
filtered data table and store it in the actual score dataframe.
Next we use ‘map’ function of python on the actual score
column in order to classify positive and negative.
Then we replace the score column with positive or negative
on the filtered data table we three lines of code as we see
above.

Now we print the data with the above command.

Now we have replaced the scores as positive and negative

from the numbers.
Time is stored in the form of unix tym stamp.
Now we clean the data :Removal of data duplication;

All the reviews are given by one user only ,we use the above
the following s ql command from the table review and sorted
according to the product id.
Finding out the error:

Now we notice the same user enters two reviews at the same
tym stamp which is impossible.So we came to know the same
product with different varieties collect the same review for
all the product varities .
Now we need to dedupe the datas.
First we sort the data using quick sort through the product id
column and store it in a data frame known as sorted_data.

Now we drop the duplicates where we categorise the

duplicates with same userid,profilename,time,text and then
we keep the first one and remove all of them as we see
below deop duplicate command;
Pandas deduplication function:

Now another anomaly in the data is that :

The helpfulness numerator < helpfulness denominator since
the following helpfulness numerator=people who said yes
Helpfulness denominator= people who said yes+false.
So we remove the data:
Command:
Now we convert text to vector :

Now we take all the positive and negative reviews in the

same geometrical plane and try to create a hyperspace in
them such that the point above the plane is taken as a
positive review and the point below the plane is taken as
negative review.
Now we first know we convert the review text into d-
dimension data and then find a plane to separate them.
But first we need to know find out how to convert the text
pattern in d-dimensional vector.
Now let r1,r2,r3 are three review and the similarity of
(r1,r2)>(r1,r3) then practically the distance between r1 and
r2 vector must be less. That means the length must be small.
Strategies of converting text to vector:

1.Bag of words:
Basically in this technique we create a dictionary or set to
identify the similar texts of unique words.

We call it as a corpus.And the reviews as a document.

2.Now we construct the matrix vector:

In this case of bag of words create a vector of d-dimensions

and then we create a 1 row matrix where each slot contains
the words as shown above and the number inside the slot is
the no.of times the word repeated in the review is used .
Now the bag of words techniques will create a sparse vector
and where most of the words used from the vector box will
be zero ,the frequency of the used words from the vector box
will be zero. Hence the vector will be called as sparse vector.
where each word represent a different dimension.
The more similar frequency of the more the reviews or the
vectors are similar and closer to each other.
Now after constructing the matrix in the vector form we now
do find the distance and it comes as square root of 3 .Even
though the maximum distance would have come 9 but we
got much less which signifies the vectors distance are very
less.But the English meaning is completely opposite.
This is known as binary bag of words. As shown above
It only return 0 or 1.
Now the distance between binary bag of words roughly equal
to no.of differing words.
Stop removal, tokenisation, lemmatization:
Stop words:

The stop words are the words that are not meaningful .But in
some cases these can be highly meaningful and as soon as
the stop words are removed the vector becomes
smaller,efficient and meaningful.

We are using libraries :

2.lower case. to upper case
3.Stemming:
Now we are using the stemming where we can categorize the
words like tasty,tasteful,taste into one.(coverting them to
root form).
And for the following algorithms porter stemmer and
snowball stemmer are used and snowball stemmer is much
more efficient.

4.Lemmatization: breaking up a sentence into words

Bag of words demerits:

1.we are not using semantic meaning of words.(does not
guarantee the meaning of words).
Uni grams ,Bi grams,N grams:
Uni grams : In the unigram models we use mainly each word
as a unique dimension.
Bi gram: In the bi-gram model we mainly use the pair of
consecutive words as a unique dimension in the d-
dimension vector.
Tri grams: three consecutive words as a unique dimension
in d-dimension vector.

We are basically using n-grams because the bag of words

technique basically deleted the sequence information.
So n-grams some how are able to preserve the sequence
information.
Term frequency and inverse document frequency:

Term frequency is defined as ((no.of times Wi occur in Rj)

/(total no.of words in Rj)). How much time Wi occur in
document Rj.
Inverse document frequency:

Idf is mainly defined as IDF(Wi,Dc)=log(N/ni)

Here Wi stands for the particular word and Dc stands for the
dataset corpus.
1.IDF >=0
2.ni<=N
Now applying the tdf and idf together:
More importance is given to rarer words in my Dc
And more importance if a word is frequent in a document
review.

-Why do we use log in IDF?

Understanding Zipf’s law:
Zipf’s law follows power law distribution:
WORDS TO VECTOR : Deep learning concept

*W2 vec is a tensorflow deep learning model where we use

dense d-dimensional vector for representation of words and
its not a sparse vector (that means most of the dimensions
value is not zero).
*Now if we considers 3 words tasty , delicious , baseball .Now
we expect that tasty and delicious are nearly same then the
distance between them should be minimal.And distance
between the baseball must be the highest. And that exactly
what it happens in the tensorflow library of words 2 vector.
*And the following the vectors are semantically similar
according to the English language.
Now if we take vector man as a point and woman as another
point in 300 dimensional space then the vector constructed
by joining the points will be parallel to the vector constructed
in between king and queen as shown above that means the
words 2 vec somehow differentiates between genders and
make parallel relation in high dimensional space.
Similarly we can take the case in verb and country capital
too.
Mainly google news has been used to train the TensorFlow
learning model word2vec where we can represent 300
dimensional data at once at highest accuracy.
Similarly the word2vec model can be implemented in our
review system.

*Now understanding conversion of sentences into d-

dimensional vector using average word2vec algorithm.
Now in order to built sentence vector we first use w2vec on
words add them and find their average.
Its works roughly well but not perfect.
Bag of words codes:

100 NLP Questions
100% (6)
100 NLP Questions
23 pages
28.1 - 28.16 Real World Problem - Predict Rating Given Product Reviews On Amazon
No ratings yet
28.1 - 28.16 Real World Problem - Predict Rating Given Product Reviews On Amazon
19 pages
Sentiment Analysis On Amazon Fine Food Reviews by Using Linear Machine Learning Models
No ratings yet
Sentiment Analysis On Amazon Fine Food Reviews by Using Linear Machine Learning Models
6 pages
Feature Engineering
100% (2)
Feature Engineering
44 pages
Pipeline
No ratings yet
Pipeline
9 pages
4. Chapter 8 Text Analytics
No ratings yet
4. Chapter 8 Text Analytics
42 pages
Data Mining Numericals
No ratings yet
Data Mining Numericals
38 pages
Unit iv
No ratings yet
Unit iv
58 pages
Amazon Assignment Ex
No ratings yet
Amazon Assignment Ex
11 pages
5.2_feature_engineering
No ratings yet
5.2_feature_engineering
57 pages
Feature extraction techniques in NLP
No ratings yet
Feature extraction techniques in NLP
10 pages
MLA TAB Lecture2
No ratings yet
MLA TAB Lecture2
84 pages
ML UNIT-II
No ratings yet
ML UNIT-II
27 pages
Term Weighting & The Vector Space Model
No ratings yet
Term Weighting & The Vector Space Model
2 pages
Lab 5
No ratings yet
Lab 5
27 pages
ML week 8
No ratings yet
ML week 8
12 pages
Truncated SVD
No ratings yet
Truncated SVD
27 pages
03 Amazon Fine Food Reviews Analysis - KNN
No ratings yet
03 Amazon Fine Food Reviews Analysis - KNN
71 pages
8
No ratings yet
8
9 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
04 - Text Representation
No ratings yet
04 - Text Representation
131 pages
Information Retrival
No ratings yet
Information Retrival
43 pages
Unit iv
No ratings yet
Unit iv
57 pages
NLP Soc
No ratings yet
NLP Soc
15 pages
18.2 - Data Matrix Notation - mp4
No ratings yet
18.2 - Data Matrix Notation - mp4
3 pages
NLP Asgn3
No ratings yet
NLP Asgn3
6 pages
Matrix-Vector Multiplication by MapReduce-V2
No ratings yet
Matrix-Vector Multiplication by MapReduce-V2
26 pages
DM Chapter 9 - word embedding
No ratings yet
DM Chapter 9 - word embedding
7 pages
Module III
No ratings yet
Module III
42 pages
L02-IR Models MMN
No ratings yet
L02-IR Models MMN
27 pages
IR Journal (Printable)
No ratings yet
IR Journal (Printable)
20 pages
NLP Asgn2
No ratings yet
NLP Asgn2
7 pages
A Recommender System: John Urbanic
No ratings yet
A Recommender System: John Urbanic
36 pages
BDA3
No ratings yet
BDA3
61 pages
Semantic_Technology-Assisted_Review_STAR_Document_
No ratings yet
Semantic_Technology-Assisted_Review_STAR_Document_
14 pages
NLP m3
No ratings yet
NLP m3
111 pages
Computational Tools DTU Presentation Week3
No ratings yet
Computational Tools DTU Presentation Week3
33 pages
IR
No ratings yet
IR
12 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
Vector Space Model
No ratings yet
Vector Space Model
7 pages
11 Amazon Fine Food Reviews Analysis - Truncated SVD - WIP - Jupyter Notebook
No ratings yet
11 Amazon Fine Food Reviews Analysis - Truncated SVD - WIP - Jupyter Notebook
22 pages
IR Practical Theory.docx
No ratings yet
IR Practical Theory.docx
9 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
Predictive Methods For Text Mining
No ratings yet
Predictive Methods For Text Mining
75 pages
unit2newml
No ratings yet
unit2newml
25 pages
Amazon-Fine-Food-Review - K-Means, Agglomerative & DBSCAN Clustering
No ratings yet
Amazon-Fine-Food-Review - K-Means, Agglomerative & DBSCAN Clustering
79 pages
Ir Practical Manual 2
No ratings yet
Ir Practical Manual 2
24 pages
Text Sentiment Analysis
No ratings yet
Text Sentiment Analysis
59 pages
Practical Advanced TypeScript
From Everand
Practical Advanced TypeScript
Bledar Ramo
No ratings yet
Unit V Easy To Learn
No ratings yet
Unit V Easy To Learn
21 pages
Lect04
No ratings yet
Lect04
44 pages
sheet 3 (3)
No ratings yet
sheet 3 (3)
5 pages
Lect_06_Feature_Engineering_and_Selection
No ratings yet
Lect_06_Feature_Engineering_and_Selection
41 pages
bag_of_words nlp
No ratings yet
bag_of_words nlp
23 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
IR-Lab Manual A1
No ratings yet
IR-Lab Manual A1
3 pages
Text Analysis
No ratings yet
Text Analysis
15 pages
Module 3 Indexing Part A
No ratings yet
Module 3 Indexing Part A
46 pages
Lecture -7 MSDS
No ratings yet
Lecture -7 MSDS
32 pages
9 Feature Engineering Text Data
No ratings yet
9 Feature Engineering Text Data
7 pages
Text Mining
No ratings yet
Text Mining
31 pages
Evaluation of Sentiment Analysis in Finance: From Lexicons To Transformers
No ratings yet
Evaluation of Sentiment Analysis in Finance: From Lexicons To Transformers
21 pages
preprints202305.1649.v1
No ratings yet
preprints202305.1649.v1
25 pages
Why MapReduce
No ratings yet
Why MapReduce
8 pages
17BEC096
No ratings yet
17BEC096
61 pages
01 Create An Azure AI Search Solution
No ratings yet
01 Create An Azure AI Search Solution
34 pages
Text Summarization Using NLP
No ratings yet
Text Summarization Using NLP
6 pages
Week 6: Introduction To Natural Language Processing
No ratings yet
Week 6: Introduction To Natural Language Processing
18 pages
Arabic Automatic Speech Recognition Transcripts
No ratings yet
Arabic Automatic Speech Recognition Transcripts
9 pages
Combine PDF
No ratings yet
Combine PDF
18 pages
Abstract. This Chapter Discusses Content-Based Recommendation Systems, I.e.
No ratings yet
Abstract. This Chapter Discusses Content-Based Recommendation Systems, I.e.
17 pages
Minor Project Report
No ratings yet
Minor Project Report
49 pages
49.automatic English Essay Scoring Algorithm Based On Machine Learning
No ratings yet
49.automatic English Essay Scoring Algorithm Based On Machine Learning
4 pages
Sentiment Identification in Football-Specific Tweets: Corresponding Author: Samah Aloufi (Salou102@uottawa - Ca)
No ratings yet
Sentiment Identification in Football-Specific Tweets: Corresponding Author: Samah Aloufi (Salou102@uottawa - Ca)
13 pages
Business Intelligence
No ratings yet
Business Intelligence
76 pages
Classification of Holy Quran Translation
No ratings yet
Classification of Holy Quran Translation
8 pages
Mining Free Text Medical Notes
No ratings yet
Mining Free Text Medical Notes
8 pages
Marutho 2018
No ratings yet
Marutho 2018
6 pages
A Comprehensive Survey On Query Expansion Techniques, Their Issues and Challenges
No ratings yet
A Comprehensive Survey On Query Expansion Techniques, Their Issues and Challenges
4 pages
ML 12 NLP Example
No ratings yet
ML 12 NLP Example
30 pages
Big Data - Spark
100% (1)
Big Data - Spark
72 pages
Module 5 - Information Retrieval and Lexical Resources
0% (1)
Module 5 - Information Retrieval and Lexical Resources
80 pages
Bannari Amman Institute of Technology
No ratings yet
Bannari Amman Institute of Technology
10 pages
Unit 4 - Machine Learning - WWW - Rgpvnotes.in PDF
No ratings yet
Unit 4 - Machine Learning - WWW - Rgpvnotes.in PDF
27 pages
A Map-Based Recommendation System and House Price Prediction Model For Real Estate
No ratings yet
A Map-Based Recommendation System and House Price Prediction Model For Real Estate
19 pages
Information Retrieval Practical
No ratings yet
Information Retrieval Practical
10 pages
Text
No ratings yet
Text
11 pages
Fan & Qin, 2018, Research On Text Classification Based On Improved TF-IDF Algorithm
No ratings yet
Fan & Qin, 2018, Research On Text Classification Based On Improved TF-IDF Algorithm
6 pages
CT3 Set A - Qwerty
No ratings yet
CT3 Set A - Qwerty
4 pages
Automated Ticket Resolution
No ratings yet
Automated Ticket Resolution
10 pages

Amazon Food Review Notes

Uploaded by

Amazon Food Review Notes

Uploaded by

AMAZON FOOD REVIEWS:

Amazon food reviews project is used to differentiate the

Data was collected by the team snap (Stanford network

The attributes are :

We consider the rating as our primary source as the positive

8 attribute information we will use to predict the user

Now we created a connection to the sql data base by using

Then we define a function to classify the rating s positive and

Now we print the data with the above command.

Now we have replaced the scores as positive and negative

Now we drop the duplicates where we categorise the

Now another anomaly in the data is that :

Now we take all the positive and negative reviews in the

We call it as a corpus.And the reviews as a document.

In this case of bag of words create a vector of d-dimensions

We are using libraries :

4.Lemmatization: breaking up a sentence into words

Bag of words demerits:

We are basically using n-grams because the bag of words

Term frequency is defined as ((no.of times Wi occur in Rj)

Idf is mainly defined as IDF(Wi,Dc)=log(N/ni)

-Why do we use log in IDF?

*W2 vec is a tensorflow deep learning model where we use

*Now understanding conversion of sentences into d-

You might also like