Classify Movie Reviews with Recurrent Neural Network (RNN) Model

ICS 661 Advanced AI - Assignment 2

Overview

This dataset contains movie reviews labeled with binary sentiment polarity (positive or negative) and is intended as a benchmark for sentiment classification tasks. This document outlines the structure of the dataset and the tasks associated with its usage in this project.

Dataset

The core dataset consists of 50,000 movie reviews, split equally into 25,000 reviews for training and 25,000 for testing. The label distribution is balanced, with 25,000 positive and 25,000 negative reviews.

To reduce correlation in ratings for the same movie, no more than 30 reviews are included for any single movie. Additionally, the training and test sets contain reviews from different sets of movies, ensuring that performance is not skewed by memorizing movie-specific terms linked to their sentiment labels.

A negative review is defined as having a score of 4 or less out of 10.
A positive review is defined as having a score of 7 or more out of 10.
Reviews with neutral ratings (scores 5 or 6) are excluded from the dataset.

File Structure

The dataset is organized into two top-level directories: train/ and test/, which correspond to the training and test sets. Each of these directories contains two subdirectories:

pos/: Positive reviews
neg/: Negative reviews

Each review is stored as a text file, named according to the pattern [id]_[rating].txt, where:

[id] is a unique identifier for the review
[rating] is the original star rating of the review (on a scale of 1-10)

For example, the file test/pos/200_8.txt contains a positive review from the test set, with a unique id of 200 and a rating of 8/10. However, in this project, you only need to classify reviews as positive or negative, without predicting their specific ratings.

Project Task

The primary task is to classify the reviews as either positive or negative based on their text content. Text data can often be noisy, containing typos, punctuation, and irrelevant tokens, making it more challenging to handle compared to image data.

Text Preprocessing:
- Clean the dataset by removing stop words, punctuation, and normalizing the text. You may refer to the following guide for text cleaning: Cleaning Text for Machine Learning
Vocabulary Extraction:
- After cleaning the data, extract a vocabulary set from the provided dataset, which will serve as the input feature set for the model.
Model Training:
- Train a Recurrent Neural Network (RNN) model using the provided training set to classify the reviews as positive or negative. You do not need to infer the exact ratings, only the binary sentiment (positive vs negative).
- Your goal is to achieve at least 75% accuracy on the test set using the RNN model.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
README.md		README.md
rnn_model_testing.ipynb		rnn_model_testing.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Classify Movie Reviews with Recurrent Neural Network (RNN) Model

Overview

Dataset

File Structure

Project Task

About

Uh oh!

Releases

Packages

Languages

jing2003/ai-rnn

Folders and files

Latest commit

History

Repository files navigation

Classify Movie Reviews with Recurrent Neural Network (RNN) Model

Overview

Dataset

File Structure

Project Task

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages