ESTHER AWINO- SCT213-C002-0089/2023
SECOND YEAR SECOND SEMESTER FOR BSC. DATA SCIENCE AND ANALYTICS
APRIL,2025
MACHINE LEARNING COMPREHENSIVE ASSIGNMENT 1-CLASSIFICATION
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
# Step 1 & 2: Load and describe dataset
url = 'https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0120ENv3/
Dataset/ML0101EN_EDX_skill_up/cbb.csv'
df = pd.read_csv(url)
print("Dataset Shape:", df.shape)
print("\nFirst five rows:")
print(df.head())
print("\nData Info:")
print(df.info())
print("\nStatistical Summary:")
print(df.describe())
print("\nMissing Values:")
print(df.isnull().sum())
# Step 3: Data Preprocessing and Visualization
# Handle missing values (if any)
for col in df.columns:
if df[col].isnull().sum() > 0:
if df[col].dtype == 'object':
df[col] = df[col].fillna('None')
else:
df[col] = df[col].fillna(df[col].median())
# Encode 'CONF' categorical column
le = LabelEncoder()
df['CONF'] = le.fit_transform(df['CONF'])
# Target Variable: Did the team reach POSTSEASON (1) or not (0)
y = df['POSTSEASON'].apply(lambda x: 0 if x == 'None' else 1)
# Feature selection
features = ['ADJOE', 'ADJDE', 'BARTHAG', 'EFG_O', 'EFG_D', 'TOR', 'TORD',
'ORB', 'DRB', 'FTR', 'FTRD', '2P_O', '2P_D', '3P_O', '3P_D',
'ADJ_T', 'WAB', 'CONF']
X = df[features]
# Step 4: Data Visualization
# Create correlation heatmap
plt.figure(figsize=(12, 8))
correlation_matrix = df[['ADJOE', 'ADJDE', 'BARTHAG', 'EFG_O', 'EFG_D', 'TOR',
'TORD']].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Heatmap of Key Features')
plt.tight_layout()
plt.show()
# Distribution of offensive and defensive efficiency
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
sns.histplot(df['ADJOE'], kde=True)
plt.title('Distribution of Adjusted Offensive Efficiency')
plt.subplot(1, 2, 2)
sns.histplot(df['ADJDE'], kde=True)
plt.title('Distribution of Adjusted Defensive Efficiency')
plt.tight_layout()
plt.show()
print("Feature distributions and correlations visualized above.")
# Step 5: Normalization
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Step 6: Training and Validation
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
models = {
'KNN': KNeighborsClassifier(n_neighbors=5),
'Decision Tree': DecisionTreeClassifier(random_state=42),
'SVM': SVC(kernel='rbf', random_state=42),
'Logistic Regression': LogisticRegression(random_state=42)
trained_models = {}
for name, model in models.items():
model.fit(X_train, y_train)
trained_models[name] = model
print(f"{name} model trained successfully.")
# Step 7: Model Evaluation
results = {}
plt.figure(figsize=(20,5))
for i, (name, model) in enumerate(trained_models.items(), 1):
plt.subplot(1, 4, i)
y_pred = model.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title(f'{name} Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
acc = accuracy_score(y_test, y_pred)
results[name] = acc
plt.tight_layout()
plt.show()
for name, model in trained_models.items():
print(f"Classification Report for {name}:")
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
# Step 8: Comparative Analysis
print("\nComparative Accuracy:")
results_df = pd.DataFrame(list(results.items()), columns=['Model', 'Accuracy'])
print(results_df.sort_values(by='Accuracy', ascending=False))
# Conclusion
best_model = results_df.sort_values(by='Accuracy', ascending=False).iloc[0]
print(f"\nBest Performing Model: {best_model['Model']} with Accuracy: {best_model['Accuracy']:.2f}")
Results and Key insights.
The dataset had missing values hence were handled by replacement with either null (for
categorical data) or medium (for numerical data ) for the dataset to remain interpretable and
maintain its central tendency.
Preprocessing:
Categorical Encoding: The column 'CONF' was label-encoded.
Target Variable: Transformed the 'POSTSEASON' column into binary labels (1 for
postseason, 0 otherwise).
Feature Selection: Key features such as 'ADJOE', 'ADJDE', 'BARTHAG', etc., were
extracted for analysis.
Visualization:
From the Correlation Heatmap:
Strong Positive Correlations:
BARTHAG correlates positively with ADJOE and EFG_O.
EFG_O and ADJOE are positively linked, showing that better field goal percentages improve offensive
efficiency
Negative Correlations:
ADJDE negatively correlates with offensive metrics (lower values indicate better defense).
TOR negatively correlates with offensive metrics, meaning higher turnover rates reduce offensive
efficiency
Distribution plots
Offensive Efficiency (ADJOE)
Shows a normal distribution, indicating most teams hover near the league average with few extremes.
Defensive Efficiency (ADJDE)
Also follows a normal distribution but has a tighter spread , suggesting less variation across teams.
Used StandardScaler for feature normalization to prepare data for training.
Model training and validation:
The following models were used:
Classification Report for KNN:
precision recall f1-score support
0 0.91 0.96 0.94 223
1 0.81 0.66 0.73 59
accuracy 0.90 282
macro avg 0.86 0.81 0.83 282
weighted avg 0.89 0.90 0.89 282
Classification Report for Decision Tree:
precision recall f1-score support
0 0.92 0.92 0.92 223
1 0.71 0.71 0.71 59
accuracy 0.88 282
macro avg 0.82 0.82 0.82 282
weighted avg 0.88 0.88 0.88 282
Classification Report for SVM:
precision recall f1-score support
0 0.93 0.97 0.95 223
1 0.88 0.71 0.79 59
accuracy 0.92 282
macro avg 0.90 0.84 0.87 282
weighted avg 0.92 0.92 0.92 282
Classification Report for Logistic Regression:
precision recall f1-score support
0 0.94 0.97 0.96 223
1 0.88 0.78 0.83 59
accuracy 0.93 282
macro avg 0.91 0.88 0.89 282
weighted avg 0.93 0.93 0.93 282
Comparative Accuracy:
Model Accuracy
3 Logistic Regression 0.932624
2 SVM 0.918440
0 KNN 0.897163
1 Decision Tree 0.879433
Best Performing Model: Logistic Regression with Accuracy: 0.93