This project aims to detect fraudulent transactions in e-commerce credit data using machine learning. The workflow includes data analysis, preprocessing, model building, evaluation, and explainability using SHAP.
Fraud_Data.csv: Contains transaction and user information with a 'class' label indicating fraud.IpAddress_to_Country.csv: Maps IP address ranges to countries for geolocation analysis.creditcard.csv: Standard credit card fraud dataset with a 'Class' label.
fraud-detection-ecommerce-credit/
├── data/ # Raw data files
├── notebooks/ # Jupyter notebooks for analysis
├── output/ # Output files and results
├── scripts/ # (Optional) Python scripts
├── src/ # Source code for preprocessing, etc.
└── README.md # Project documentation
- Clone the repository and navigate to the project directory.
- Install dependencies (recommended: use a virtual environment):
pip install pandas numpy matplotlib seaborn scikit-learn shap missingno # For XGBoost or LightGBM, install as needed: pip install xgboost lightgbm - Download the data files and place them in the
data/directory. - Open the main notebook:
notebooks/01_data_analysis_preprocessing.ipynb
- Launch Jupyter Notebook:
jupyter notebook
- Open the notebook and run cells sequentially.
- Handle missing values (impute/drop)
- Data cleaning (remove duplicates, correct types)
- Exploratory Data Analysis (EDA)
- Merge datasets for geolocation
- Feature engineering (transaction frequency, time-based features)
- Data transformation (class imbalance, scaling, encoding)
- Train-test split
- Model selection: Logistic Regression (baseline), Random Forest or XGBoost (ensemble)
- Model evaluation: AUC-PR, F1-Score, Confusion Matrix
- Model comparison and justification
- Use SHAP to interpret the best model
- Generate summary and force plots
- Discuss key drivers of fraud
- Python 3.7+
- pandas, numpy, matplotlib, seaborn
- scikit-learn
- shap
- missingno
- xgboost or lightgbm (optional, for ensemble models)
- Ensure all data files are present in the
data/directory before running the notebook. - For large datasets, ensure sufficient memory and processing power.
This project is for educational purposes.