0% found this document useful (0 votes)

18 views8 pages

Machine Learning Final Report

The document details a comprehensive assignment on machine learning classification for a BSc in Data Science and Analytics, focusing on a dataset related to postseason performance. It includes steps for data loading, preprocessing, visualization, model training, and evaluation using various classifiers, ultimately identifying Logistic Regression as the best-performing model with an accuracy of 0.93. Key insights include handling missing values, feature selection, and the correlation between offensive and defensive metrics.

Uploaded by

jorambwana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views8 pages

Machine Learning Final Report

Uploaded by

jorambwana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

ESTHER AWINO- SCT213-C002-0089/2023

SECOND YEAR SECOND SEMESTER FOR BSC. DATA SCIENCE AND ANALYTICS

APRIL,2025

MACHINE LEARNING COMPREHENSIVE ASSIGNMENT 1-CLASSIFICATION

# Import required libraries

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler, LabelEncoder

from sklearn.neighbors import KNeighborsClassifier

from sklearn.tree import DecisionTreeClassifier

from sklearn.svm import SVC

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

# Step 1 & 2: Load and describe dataset

url = 'https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0120ENv3/
Dataset/ML0101EN_EDX_skill_up/cbb.csv'

df = pd.read_csv(url)

print("Dataset Shape:", df.shape)

print("\nFirst five rows:")

print(df.head())

print("\nData Info:")
print(df.info())

print("\nStatistical Summary:")

print(df.describe())

print("\nMissing Values:")

print(df.isnull().sum())

# Step 3: Data Preprocessing and Visualization

# Handle missing values (if any)

for col in df.columns:

if df[col].isnull().sum() > 0:

if df[col].dtype == 'object':

df[col] = df[col].fillna('None')

else:

df[col] = df[col].fillna(df[col].median())

# Encode 'CONF' categorical column

le = LabelEncoder()

df['CONF'] = le.fit_transform(df['CONF'])

# Target Variable: Did the team reach POSTSEASON (1) or not (0)

y = df['POSTSEASON'].apply(lambda x: 0 if x == 'None' else 1)

# Feature selection

features = ['ADJOE', 'ADJDE', 'BARTHAG', 'EFG_O', 'EFG_D', 'TOR', 'TORD',

'ORB', 'DRB', 'FTR', 'FTRD', '2P_O', '2P_D', '3P_O', '3P_D',

'ADJ_T', 'WAB', 'CONF']

X = df[features]
# Step 4: Data Visualization

# Create correlation heatmap

plt.figure(figsize=(12, 8))

correlation_matrix = df[['ADJOE', 'ADJDE', 'BARTHAG', 'EFG_O', 'EFG_D', 'TOR',

'TORD']].corr()

sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)

plt.title('Correlation Heatmap of Key Features')

plt.tight_layout()

plt.show()

# Distribution of offensive and defensive efficiency

plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)

sns.histplot(df['ADJOE'], kde=True)

plt.title('Distribution of Adjusted Offensive Efficiency')

plt.subplot(1, 2, 2)

sns.histplot(df['ADJDE'], kde=True)

plt.title('Distribution of Adjusted Defensive Efficiency')

plt.tight_layout()

plt.show()

print("Feature distributions and correlations visualized above.")

# Step 5: Normalization

scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)

# Step 6: Training and Validation

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

models = {

'KNN': KNeighborsClassifier(n_neighbors=5),

'Decision Tree': DecisionTreeClassifier(random_state=42),

'SVM': SVC(kernel='rbf', random_state=42),

'Logistic Regression': LogisticRegression(random_state=42)

trained_models = {}

for name, model in models.items():

model.fit(X_train, y_train)

trained_models[name] = model

print(f"{name} model trained successfully.")

# Step 7: Model Evaluation

results = {}

plt.figure(figsize=(20,5))

for i, (name, model) in enumerate(trained_models.items(), 1):

plt.subplot(1, 4, i)

y_pred = model.predict(X_test)

cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')

plt.title(f'{name} Confusion Matrix')

plt.xlabel('Predicted')

plt.ylabel('Actual')

acc = accuracy_score(y_test, y_pred)

results[name] = acc

plt.tight_layout()

plt.show()

for name, model in trained_models.items():

print(f"Classification Report for {name}:")

y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred))

# Step 8: Comparative Analysis

print("\nComparative Accuracy:")

results_df = pd.DataFrame(list(results.items()), columns=['Model', 'Accuracy'])

print(results_df.sort_values(by='Accuracy', ascending=False))

# Conclusion

best_model = results_df.sort_values(by='Accuracy', ascending=False).iloc[0]

print(f"\nBest Performing Model: {best_model['Model']} with Accuracy: {best_model['Accuracy']:.2f}")

Results and Key insights.

 The dataset had missing values hence were handled by replacement with either null (for
categorical data) or medium (for numerical data ) for the dataset to remain interpretable and
maintain its central tendency.
 Preprocessing:
Categorical Encoding: The column 'CONF' was label-encoded.
Target Variable: Transformed the 'POSTSEASON' column into binary labels (1 for
postseason, 0 otherwise).
Feature Selection: Key features such as 'ADJOE', 'ADJDE', 'BARTHAG', etc., were
extracted for analysis.

 Visualization:

From the Correlation Heatmap:

Strong Positive Correlations:
BARTHAG correlates positively with ADJOE and EFG_O.
EFG_O and ADJOE are positively linked, showing that better field goal percentages improve offensive
efficiency
Negative Correlations:
ADJDE negatively correlates with offensive metrics (lower values indicate better defense).
TOR negatively correlates with offensive metrics, meaning higher turnover rates reduce offensive
efficiency
Distribution plots
Offensive Efficiency (ADJOE)
Shows a normal distribution, indicating most teams hover near the league average with few extremes.
Defensive Efficiency (ADJDE)
Also follows a normal distribution but has a tighter spread , suggesting less variation across teams.

 Used StandardScaler for feature normalization to prepare data for training.

 Model training and validation:
The following models were used:

Classification Report for KNN:

precision recall f1-score support
0 0.91 0.96 0.94 223
1 0.81 0.66 0.73 59

accuracy 0.90 282

macro avg 0.86 0.81 0.83 282
weighted avg 0.89 0.90 0.89 282

Classification Report for Decision Tree:

precision recall f1-score support

0 0.92 0.92 0.92 223

1 0.71 0.71 0.71 59

accuracy 0.88 282

macro avg 0.82 0.82 0.82 282
weighted avg 0.88 0.88 0.88 282

Classification Report for SVM:

precision recall f1-score support

0 0.93 0.97 0.95 223

1 0.88 0.71 0.79 59

accuracy 0.92 282

macro avg 0.90 0.84 0.87 282
weighted avg 0.92 0.92 0.92 282

Classification Report for Logistic Regression:

precision recall f1-score support

0 0.94 0.97 0.96 223

1 0.88 0.78 0.83 59

accuracy 0.93 282

macro avg 0.91 0.88 0.89 282
weighted avg 0.93 0.93 0.93 282

Comparative Accuracy:
Model Accuracy
3 Logistic Regression 0.932624
2 SVM 0.918440
0 KNN 0.897163
1 Decision Tree 0.879433

Best Performing Model: Logistic Regression with Accuracy: 0.93

ML Mini Project
No ratings yet
ML Mini Project
9 pages
ML Mini Project
No ratings yet
ML Mini Project
9 pages
Machine Learning Assignment
No ratings yet
Machine Learning Assignment
8 pages
ML Assignment 4
No ratings yet
ML Assignment 4
7 pages
Maternal-Risk-Prediction - Ipynb - Colab
No ratings yet
Maternal-Risk-Prediction - Ipynb - Colab
9 pages
WINSEM2024-25 CSE3008 ELA AP2024254001161 2025-02-13 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE3008 ELA AP2024254001161 2025-02-13 Reference-Material-I
2 pages
6 - 2 - SVMS, - Randon - Forests - and - KNN - Ipynb - Colaboratory
No ratings yet
6 - 2 - SVMS, - Randon - Forests - and - KNN - Ipynb - Colaboratory
4 pages
Employee Commute Prediction
100% (1)
Employee Commute Prediction
41 pages
(REPORT) LAB - 2 - Decision - Tree
No ratings yet
(REPORT) LAB - 2 - Decision - Tree
17 pages
Machine Learning Assignment
No ratings yet
Machine Learning Assignment
7 pages
CD 505 Itds Practical 1
No ratings yet
CD 505 Itds Practical 1
8 pages
Experiment 7
No ratings yet
Experiment 7
3 pages
Confusion Matrix
No ratings yet
Confusion Matrix
5 pages
ML Lab 8
No ratings yet
ML Lab 8
9 pages
Mini Project
No ratings yet
Mini Project
9 pages
About The Dataset - Car Evaluation Dataset (UCI Machine Learning Repository
No ratings yet
About The Dataset - Car Evaluation Dataset (UCI Machine Learning Repository
5 pages
Dsbda 5
No ratings yet
Dsbda 5
4 pages
G 203008076 - 4 - Christhian Quiñonez - Ex1 - 2 A PDF
No ratings yet
G 203008076 - 4 - Christhian Quiñonez - Ex1 - 2 A PDF
20 pages
ML Lab Assessment 4
No ratings yet
ML Lab Assessment 4
4 pages
Ml-Exp-2 - Jupyter Notebook
No ratings yet
Ml-Exp-2 - Jupyter Notebook
2 pages
ML Lab Programs 2
No ratings yet
ML Lab Programs 2
16 pages
Ads 5
No ratings yet
Ads 5
5 pages
Credit Card Approval Prediction Report-Final
No ratings yet
Credit Card Approval Prediction Report-Final
27 pages
Ads Exp 4
No ratings yet
Ads Exp 4
4 pages
FRA Project Report - Chilla Nagaraju
100% (1)
FRA Project Report - Chilla Nagaraju
66 pages
Loan Default Prediction System 1753830667
No ratings yet
Loan Default Prediction System 1753830667
11 pages
Rev Insurance Business Report
No ratings yet
Rev Insurance Business Report
4 pages
Exp8 Wa Aids
No ratings yet
Exp8 Wa Aids
3 pages
22K61A0654 2 Sasi Auto
No ratings yet
22K61A0654 2 Sasi Auto
24 pages
Ann Experiential Learning
No ratings yet
Ann Experiential Learning
43 pages
Final Report
No ratings yet
Final Report
17 pages
Multi - Class - Scaled - Down - Data - Colaboratory
No ratings yet
Multi - Class - Scaled - Down - Data - Colaboratory
2 pages
SVM Implementation
No ratings yet
SVM Implementation
8 pages
Openlab 1
No ratings yet
Openlab 1
17 pages
FDS Lab Manual
No ratings yet
FDS Lab Manual
10 pages
EX - NO:3: Algorithm
No ratings yet
EX - NO:3: Algorithm
11 pages
MLS 2 - Classification
No ratings yet
MLS 2 - Classification
13 pages
MLAss Code
No ratings yet
MLAss Code
1 page
Case Study - Classifier
No ratings yet
Case Study - Classifier
5 pages
Module 2
No ratings yet
Module 2
151 pages
Logistic Regression Tech Document
No ratings yet
Logistic Regression Tech Document
12 pages
SVM K NN MLP With Sklearn Jupyter NoteBo
No ratings yet
SVM K NN MLP With Sklearn Jupyter NoteBo
22 pages
Articles Xgboost Classification With Smote-Enn Algorithm
No ratings yet
Articles Xgboost Classification With Smote-Enn Algorithm
11 pages
Name: Le Ho Thao Nguyen Student ID: 20194224
No ratings yet
Name: Le Ho Thao Nguyen Student ID: 20194224
9 pages
Review Paper
No ratings yet
Review Paper
3 pages
Cheat Sheet Building Supervised Learning Models
No ratings yet
Cheat Sheet Building Supervised Learning Models
3 pages
AIML Solved Paper Nov-Dec 2024
No ratings yet
AIML Solved Paper Nov-Dec 2024
2 pages
ML101 Graded Assignment 2.ipynb - Colab
No ratings yet
ML101 Graded Assignment 2.ipynb - Colab
6 pages
Machine Learningassignment
No ratings yet
Machine Learningassignment
10 pages
Bi 6 New
No ratings yet
Bi 6 New
6 pages
CCD - Ipynb - Colab
No ratings yet
CCD - Ipynb - Colab
6 pages
Decision Tree
No ratings yet
Decision Tree
6 pages
Final ML
No ratings yet
Final ML
2 pages
Unit 2 Supervised Learning
No ratings yet
Unit 2 Supervised Learning
20 pages
Binary Classifier Evaluation Guide
No ratings yet
Binary Classifier Evaluation Guide
12 pages
ML Lab 146
No ratings yet
ML Lab 146
50 pages
Ashfatmaterial
No ratings yet
Ashfatmaterial
4 pages
Machine Learning Lab Manual 06
100% (1)
Machine Learning Lab Manual 06
8 pages
Medical Data ML
No ratings yet
Medical Data ML
6 pages
002 Ostrich PDF
No ratings yet
002 Ostrich PDF
10 pages
Gastroschisis Pathway
100% (1)
Gastroschisis Pathway
18 pages
Case Study Repor Take Time
No ratings yet
Case Study Repor Take Time
18 pages
NCERT Chemistry Class 12
No ratings yet
NCERT Chemistry Class 12
190 pages
Teaching Short Stories
No ratings yet
Teaching Short Stories
28 pages
AC7114-4 Rev G AUDIT CRITERIA FOR NONDESTRUCTIVE TESTING FACILITY FILM RADIOGRAPHY SURVEY
100% (2)
AC7114-4 Rev G AUDIT CRITERIA FOR NONDESTRUCTIVE TESTING FACILITY FILM RADIOGRAPHY SURVEY
21 pages
Fall 97 Principles of Microeconomics Slide 1: R. Larry Reynolds
No ratings yet
Fall 97 Principles of Microeconomics Slide 1: R. Larry Reynolds
40 pages
Standard SRMU
0% (5)
Standard SRMU
24 pages
California Academy For Lilminius (Cal) : Lesson Plan
No ratings yet
California Academy For Lilminius (Cal) : Lesson Plan
2 pages
Linguistic Politeness in Literature
No ratings yet
Linguistic Politeness in Literature
10 pages
Term Paper Tungkol Sa k12
100% (1)
Term Paper Tungkol Sa k12
7 pages
Grade 2 Poi 2012
No ratings yet
Grade 2 Poi 2012
1 page
Schermerhorn Mgmt9 Ch14
No ratings yet
Schermerhorn Mgmt9 Ch14
62 pages
Economics Day 7
No ratings yet
Economics Day 7
3 pages
Social Media's Impact on Pakistani E-Shoppers
No ratings yet
Social Media's Impact on Pakistani E-Shoppers
31 pages
2019C MGMT871002
No ratings yet
2019C MGMT871002
4 pages
Paper RASD2010 005 Halfpenny Kihm
No ratings yet
Paper RASD2010 005 Halfpenny Kihm
12 pages
Process Piping Workshop Guide
No ratings yet
Process Piping Workshop Guide
0 pages
12 Must-Watch Mograph Videos: Grab Some Popcorn. It'S Binge Watching Time!
No ratings yet
12 Must-Watch Mograph Videos: Grab Some Popcorn. It'S Binge Watching Time!
6 pages
Lec01 Fundamentals PDF
No ratings yet
Lec01 Fundamentals PDF
14 pages
Northeast China Grain Supply Chain Analysis
No ratings yet
Northeast China Grain Supply Chain Analysis
6 pages
Teachers Details Check List: 04001 Govt HSS, Ala, Alappuzha
100% (2)
Teachers Details Check List: 04001 Govt HSS, Ala, Alappuzha
110 pages
45 5 Inverse of A Matrix
No ratings yet
45 5 Inverse of A Matrix
10 pages
Chapter 4
No ratings yet
Chapter 4
19 pages
Asia Score For Vertebra Injury
100% (1)
Asia Score For Vertebra Injury
2 pages
The AIM Test
No ratings yet
The AIM Test
4 pages
Ultima Forte Required Data Inputs For Ericsson Infrastructure
100% (1)
Ultima Forte Required Data Inputs For Ericsson Infrastructure
55 pages
Organizational Change Insights
33% (3)
Organizational Change Insights
19 pages
Ki Manna Engleza
No ratings yet
Ki Manna Engleza
19 pages
Sensory Integration 2021 On-Demand
No ratings yet
Sensory Integration 2021 On-Demand
43 pages

Machine Learning Final Report

Uploaded by

Machine Learning Final Report

Uploaded by

ESTHER AWINO- SCT213-C002-0089/2023

MACHINE LEARNING COMPREHENSIVE ASSIGNMENT 1-CLASSIFICATION

# Import required libraries

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler, LabelEncoder

from sklearn.neighbors import KNeighborsClassifier

from sklearn.tree import DecisionTreeClassifier

from sklearn.svm import SVC

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

# Step 1 & 2: Load and describe dataset

print("Dataset Shape:", df.shape)

print("\nFirst five rows:")

# Step 3: Data Preprocessing and Visualization

# Handle missing values (if any)

for col in df.columns:

# Encode 'CONF' categorical column

y = df['POSTSEASON'].apply(lambda x: 0 if x == 'None' else 1)

features = ['ADJOE', 'ADJDE', 'BARTHAG', 'EFG_O', 'EFG_D', 'TOR', 'TORD',

'ORB', 'DRB', 'FTR', 'FTRD', '2P_O', '2P_D', '3P_O', '3P_D',

'ADJ_T', 'WAB', 'CONF']

# Create correlation heatmap

correlation_matrix = df[['ADJOE', 'ADJDE', 'BARTHAG', 'EFG_O', 'EFG_D', 'TOR',

sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)

plt.title('Correlation Heatmap of Key Features')

# Distribution of offensive and defensive efficiency

plt.title('Distribution of Adjusted Offensive Efficiency')

plt.title('Distribution of Adjusted Defensive Efficiency')

print("Feature distributions and correlations visualized above.")

# Step 6: Training and Validation

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

'Decision Tree': DecisionTreeClassifier(random_state=42),

'SVM': SVC(kernel='rbf', random_state=42),

'Logistic Regression': LogisticRegression(random_state=42)

for name, model in models.items():

print(f"{name} model trained successfully.")

# Step 7: Model Evaluation

for i, (name, model) in enumerate(trained_models.items(), 1):

plt.title(f'{name} Confusion Matrix')

acc = accuracy_score(y_test, y_pred)

for name, model in trained_models.items():

print(f"Classification Report for {name}:")

# Step 8: Comparative Analysis

results_df = pd.DataFrame(list(results.items()), columns=['Model', 'Accuracy'])

best_model = results_df.sort_values(by='Accuracy', ascending=False).iloc[0]

print(f"\nBest Performing Model: {best_model['Model']} with Accuracy: {best_model['Accuracy']:.2f}")

Results and Key insights.

From the Correlation Heatmap:

 Used StandardScaler for feature normalization to prepare data for training.

Classification Report for KNN:

accuracy 0.90 282

Classification Report for Decision Tree:

0 0.92 0.92 0.92 223

accuracy 0.88 282

Classification Report for SVM:

0 0.93 0.97 0.95 223

accuracy 0.92 282

Classification Report for Logistic Regression:

0 0.94 0.97 0.96 223

accuracy 0.93 282

Best Performing Model: Logistic Regression with Accuracy: 0.93

You might also like