GitHub - jaspereap/machine-learning-pipeline: Data analysis, feature engineering, model training and evaluation

Personal Details

Name	Email address
Jasper	[email protected]

Overview of submitted folder

base/
├── .github/
│   └── workflows/
│       └── github-actions.yml
├── src/
│   ├── data.py
│   ├── model.py
│   ├── evaluate.py
│   ├── main.py
│   ├── feature_engineering.py
│   └── config.yml
├── eda.ipynb
├── requirements.txt
├── README.md
└── run.sh

Instructions for executing pipeline and modifying any parameters

To adjust parameters,

Modify config.yml, under hyperparameters in /src/

Current Parameters:

RandomForestClassifier
1. n_estimators: 200
2. random_state: 42
3. max_depth: 7
LogisticRegression:
1. C: 0.8
2. penalty: l2
3. max_iter: 700
4. solver: 'sag'
GradientBoostingClassifier:
1. n_estimators: 200
2. learning_rate: 0.2
3. random_state: 42

To run MLP

Run run.sh via ./run.sh

Description of logical steps of pipeline

Outline:

Data Loading
- Load raw data from the data source, into a format that is suitable for further processing.
Basic Data Exploration
- Initial EDA to understand the characteristics of the dataset.
- Includes checking for missing values, basic statistics and visualizing distributions of variables.
Data Preprocessing
- Cleaning and preparing the data for modeling.
- Includes handling missing values, encoding categorical variables, scaling numerical features, and other necessary data transformations.
Feature Engineering
- Create additional features to improve performance of the machine learning model.
- Includes creating transformation and interactions of existing features.
In-Depth Data Exploration
- After preprocessing and feature engineering, a more detailed EDA can be performed to understand how the new features interact with the target variable.
- Includes using uni/bivariate plots, feature importance classifiers and correlation analysis to identify relevant features
Model Selection
- Choose a set of models suitable for the prediction problem.
- In this case, for multi-class classification problem, Gradient Boosting, Logistic Regression and Random Forest models were shortlisted for evaluation.
Model Training and Evaluation
- Train selected models on processed and engineered dataset.
- Evaluate performance of each models using metrics such as accuracy, precision, recall and F1-score and choose the best performing one.
- In the script, scores for each model is printed, together with the features considered.
Hyperparameter Tuning
- Fine-tune the hyperparameters of the chosen models to further improve performance.

Exploratory Data Analysis (EDA)

Overview of key findings

Target Variable Distribution:

The distribution of 'Ticket Type' revealed a skewed representation towards Standard and Luxury ticket types.
Deluxe ticket are minority of the dataset.
Age is the most influential feature for Ticket Type.
- Older passengers tend to go for Luxury tickets
- Younger passengers tend to go for Standard and Deluxe

Age Distribution:

Older passengers highly prefers Cabin Comfort, Cleanliness and Cabin Service
Younger passengers has low preference for Cleanliness

Correlation Analysis:

Strong correlations were observed between features like Cruise Distance, Online Check-in and Onboard Experience and the target variable Ticket Type.

Feature Importance:

Age was identified as the most influential feature for prediction, followed by Onboard Experience and 'Check-in Experience'.

Choices made in pipeline based on findings

Feature Engineering:

New features Onboard Experience and Check-in Experience were created to capture aggregated service ratings
Age was also created to make Date of Birth usable in our analysis

Feature Selection:

Age was identified as the most influential feature, followed by Online Check-in, Onboard Experience and Check-in Experience

Summary of how features in the dataset are processed

Feature	Actions
Ticket Type	Drop all rows with missing value
Date of Birth	Used with `Logging` to calculate `Age`, then impute missing values with `mode`
Source of Traffic	Encode-> Direct-Company Website=0, Email Marketing=1, Indirect-Search Engine=2, Social Media=3
Onboard Wifi Service	Encode-> Numeric ratings, impute missing value with `mode`
Onboard Dining Service	Encode-> Numeric ratings, impute missing value with `mode`
Cabin Comfort	Impute missing value with `mode`
Onboard Entertainment	Encode-> Numeric ratings, impute missing value with `mode`
Cabin Service	Impute missing value with `mode`
Onboard Service	Impute missing value with `mode`
Cleanliness	Impute missing value with `mode`
Embarkation/Disembarkation time Convenient	Impute missing value with `mode`
Ease of Online booking	Impute missing value with `mode`
Gate location	Impute missing value with `mode`
Onboard Dining Service	Impute missing value with `mode`
Online Check-in	Impute missing value with `mode`
Baggage handling	Impute missing value with `mode`
Port Check-in Service	Impute missing value with `mode`
Logging	Used with `Age` to calculate `Age`
Cruise Name	Correct spelling to either `Blastoise` or `Lapras`, impute missing value with `mode`
Cruise Distance	Correct invalid distances, convert all to kilometre, impute missing value with `mean`

Explanation of choice of models for each ML task

This is a multi-class classification problem, the model needs to be able to handle multi-class target variable. Models Used:
Gradient Boosting Classifier
- Gradient Boosting Classifiers are powerful techniques for multi-class classification tasks.
Logistic Regression
- Logistic Regression is a widely used linear model for binary and multi-class classification.
- Not only is it computationally efficient, it serves as a good baseline model
Random Forest Classifier
- Random Forest Classifiers builds multiple decision trees and combines their output to make predictions.
- Works well with categorical features, well-suited for multi-class classification tasks
- Also capable of providing feature importance scores, which can be useful for understanding the influence of variables in the classification process.

Evaluation of models developed

Metrics Considered for Evaluation:

Accuracy
- Accuracy measures the overall correctness of a classification model
- (True Positives + True Negatives) / (True Positives + True Negatives + False Positives + False Negatives)
Precision
- Precision focuses on the accuracy of positive predictions.
- True Positives / (True Positives + False Positives)
Recall
- Recall measures the proportion of actual positives that were correctly predicted by the model
- True Positives / (True Positives + False Negatives)
F1-Score
- F1-Score is the harmonic mean of precision and recall
- Provides a balanced measure between precision and recall, considering both false positives and false negatives.
- 2 * (Precision * Recall) / (Precision + Recall)

Summary of Results

Overall, this analysis provides insights into the effectiveness of different machine learning models for predicting ticket types based on pre-cruise survey features.
Among the three models, the Gradient Boosting Classifier demonstrated the highest accuracy (73%), followed by the Random Forest Classifier (70%) and Logistic Regression (62%).
The Gradient Boosting Classifier showed balanced performance with precision and recall both around 70%. This suggests that it effectively identified true positives while minimizing false positives.
The Gradient Boosting Classifier had the highest F1-Score at 71%, followed by the Random Forest Classifier at 67%, and the Logistic Regression model at 60%.
Based on the evalutaion metrics, the Gradient Boosting Classifier stands out as the top-performing model.

Conclusion:

Gradient Boosting Classifier is recommended for predicting ticket types.
Further iterative fine-tuning and feature engineering is required to improve the performance further.

Metric	Gradient Boosting Classifier	Logistic Regression	Random Forest Classifier
Accuracy	0.73	0.62	0.7
Precision	0.7	0.65	0.73
Recall	0.73	0.62	0.7
F1-Score	0.71	0.6	0.67

Other considerations for deploying the models developed

Model Serving Medium

Consider suitable platform to serve the models, which can be cloud-based platforms like AWS SageMaker or Google AI Platform, or open-source solution like FastAPI.

API Integration

Develop user-friendly interface for end-users to interact with the deployed models. Ensure that it aligns with their needs and can integrate well into their exisiting systems.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Personal Details

Overview of submitted folder

Instructions for executing pipeline and modifying any parameters

To adjust parameters,

To run MLP

Description of logical steps of pipeline

Exploratory Data Analysis (EDA)

Overview of key findings

Choices made in pipeline based on findings

Summary of how features in the dataset are processed

Explanation of choice of models for each ML task

Evaluation of models developed

Summary of Results

Other considerations for deploying the models developed

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
.github/workflows		.github/workflows
src		src
.gitignore		.gitignore
README.md		README.md
eda.ipynb		eda.ipynb
requirements.txt		requirements.txt
run.sh		run.sh

jaspereap/machine-learning-pipeline

Folders and files

Latest commit

History

Repository files navigation

Personal Details

Overview of submitted folder

Instructions for executing pipeline and modifying any parameters

To adjust parameters,

To run MLP

Description of logical steps of pipeline

Exploratory Data Analysis (EDA)

Overview of key findings

Choices made in pipeline based on findings

Summary of how features in the dataset are processed

Explanation of choice of models for each ML task

Evaluation of models developed

Summary of Results

Other considerations for deploying the models developed

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages