This project implements a deep learning model to detect spam questions on Quora using a bidirectional LSTM architecture.
spam_filter_quora.ipynb
: Main notebook containing the model implementationrequirements.txt
: List of Python dependencies.gitignore
: Git ignore rulesREADME.md
: Project documentation
- Create a virtual environment:
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
- Download GloVe embeddings:
Download glove.6B.300d.txt from Stanford NLP website
- Bidirectional LSTM with GloVe embeddings (300d)
- Multiple LSTM layers with dropout (0.5)
- Dense layers for classification
- Global Average Pooling
- Adam optimizer with learning rate scheduling
This repository includes:
-
train.csv
: Training dataset containing Quora questions with following columns:- qid: Unique question identifier
- question_text: The actual question text
- target: Binary label (0 for non-spam, 1 for spam)
-
Initial class distribution:
- Non-spam (0): 1,225,312
- Spam (1): 80,810
-
Balanced using RandomOverSampler to handle class imbalance
Classification Report (with threshold = 0.9):
precision | recall | f1-score | support | |
---|---|---|---|---|
0 | 1.00 | 0.98 | 0.99 | 245063 |
1 | 0.98 | 1.00 | 0.99 | 245062 |
accuracy | 0.99 | 490125 | ||
macro avg | 0.99 | 0.99 | 0.99 | 490125 |
weighted avg | 0.99 | 0.99 | 0.99 | 490125 |
- Best F1 Score: 0.989
- Final Validation Accuracy: 98.83%
- Final Validation AUC: 0.991
- Maximum words: 30,000
- Maximum sequence length: 100
- Embedding dimension: 300
- LSTM units: 128
- Dense units: 64
- Dropout rate: 0.5
- Learning rate: 0.001 (with reduction on plateau)