EC2011E Foundations of Machine Learning
Programming Assignment Report
Team Members:-
Kaigala Mani Charan – B230999EC
Kamana Narendra Subbaraj – B231001EC
K Vinay – B230996EC
1. Binary Classification: Breast Cancer Wisconsin (Diagnostic) Dataset
1.1 Dataset Description
The Breast Cancer Wisconsin (Diagnostic) dataset is widely used for binary classification tasks in the
medical domain. It consists of 569 instances with 30 real-valued input features computed from
digitized images of fine needle aspirates (FNA) of breast masses. The diagnosis (target variable) has
two classes:
M = Malignant (cancerous)
B = Benign (non-cancerous)
For each of the 10 features (radius, texture, perimeter, area, smoothness, compactness, concavity,
concave points, symmetry, and fractal dimension), the dataset provides:
Mean
Standard Error
Worst (maximum) value
1.2 Preprocessing Steps
Data loading and preprocessing
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
data = pd.read_csv("wdbc.data", header=None)
columns = ['ID', 'Diagnosis'] + [
f"{feat}_{stat}" for stat in ['mean', 'se', 'worst'] for feat in [
'radius', 'texture', 'perimeter', 'area', 'smoothness', 'compactness',
'concavity', 'concave_points', 'symmetry', 'fractal_dimension']
data.columns = columns
data.drop('ID', axis=1, inplace=True)
data['Diagnosis'] = data['Diagnosis'].map({'M': 1, 'B': 0})
features = [col for col in data.columns if '_mean' in col]
X = data[features]
y = data['Diagnosis']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
1.3 Models Implemented
Naive Bayes Classifier using GaussianNB()
K-Nearest Neighbors (KNN) with k=5 using KNeighborsClassifier()
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
Naive Bayes
nb = GaussianNB()
nb.fit(X_train, y_train)
y_pred_nb = nb.predict(X_test)
KNN
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)
y_pred_knn = knn.predict(X_test_scaled)
1.4 Evaluation Results
Naive Bayes Classifier Output:
Accuracy: 0.9474
Confusion Matrix:
[[70 1]
[ 5 38]]
Classification Report:
precision recall f1-score support
0 0.93 0.99 0.96 71
1 0.97 0.88 0.93 43
accuracy 0.95 114
macro avg 0.95 0.93 0.94 114
weighted avg 0.95 0.95 0.95 114
KNN Classifier Output (k = 5):
Accuracy: 0.9474
Confusion Matrix:
[[68 3]
[ 3 40]]
Classification Report:
precision recall f1-score support
0 0.96 0.96 0.96 71
1 0.93 0.93 0.93 43
accuracy 0.95 114
macro avg 0.94 0.94 0.94 114
weighted avg 0.95 0.95 0.95 114
2. PCA-Based Dimensionality Reduction
from sklearn.decomposition import PCA
for k in [10, 9, 8]:
pca = PCA(n_components=k)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)
nb_pca = GaussianNB()
nb_pca.fit(X_train_pca, y_train)
print(f"Naive Bayes Accuracy with PCA-{k}:", accuracy_score(y_test, nb_pca.predict(X_test_pca)))
knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
print(f"KNN Accuracy with PCA-{k}:", accuracy_score(y_test, knn_pca.predict(X_test_pca)))
PCA Results (k = number of components used):
Principal Components Naive Bayes Accuracy KNN Accuracy
10 0.9123 0.9474
9 0.9211 0.9474
8 0.9211 0.9474
3. KNN Hyperparameter Tuning
k_values = list(range(1, 16))
accuracies = []
for k in k_values:
model = KNeighborsClassifier(n_neighbors=k)
model.fit(X_train_scaled, y_train)
acc = model.score(X_test_scaled, y_test)
accuracies.append(acc)
plt.figure(figsize=(8, 5))
plt.plot(k_values, accuracies, marker='o')
plt.title("KNN Accuracy vs k")
plt.xlabel("k")
plt.ylabel("Accuracy")
plt.grid()
plt.show()
plot:
Observation:
Highest accuracy observed around k = 5
Smaller k leads to overfitting; higher k leads to underfitting.
4. Multi-Class Classification: Car Evaluation Dataset
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
car_data = pd.read_csv("car.data", header=None)
car_data.columns = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class']
encoder = OrdinalEncoder()
X_car = encoder.fit_transform(car_data.drop('class', axis=1))
y_car = car_data['class']
Xc_train, Xc_test, yc_train, yc_test = train_test_split(X_car, y_car, test_size=0.2, random_state=42)
Decision Tree
dt = DecisionTreeClassifier(random_state=42)
dt.fit(Xc_train, yc_train)
yc_pred_dt = dt.predict(Xc_test)
Decision Tree Results:
Accuracy: 0.9739884393063584
Classification Report:
precision recall f1-score support
acc 0.97 0.92 0.94 83
good 0.62 0.91 0.74 11
unacc 1.00 1.00 1.00 235
vgood 1.00 0.94 0.97 17
accuracy 0.97 346
macro avg 0.90 0.94 0.91 346
weighted avg 0.98 0.97 0.98 346
Random Forest
rf = RandomForestClassifier(random_state=42)
rf.fit(Xc_train, yc_train)
yc_pred_rf = rf.predict(Xc_test)
Random Forest Results:
Accuracy: 0.9739884393063584
Classification Report:
precision recall f1-score support
acc 0.99 0.90 0.94 83
good 0.65 1.00 0.79 11
unacc 0.99 1.00 1.00 235
vgood 1.00 0.94 0.97 17
accuracy 0.97 346
macro avg 0.91 0.96 0.92 346
weighted avg 0.98 0.97 0.98 346
5. Conclusion
KNN slightly outperformed Naive Bayes for binary classification, especially with scaling.
PCA reduced dimensionality while maintaining high accuracy, especially for KNN.
k = 5 was optimal for KNN in this dataset.
Random Forest outperformed Decision Tree on the multi-class car dataset due to better
generalization from ensemble learning.
The assignment highlights the importance of preprocessing, model selection, and
hyperparameter tuning in practical ML applications.
6. References
UCI Machine Learning Repository
scikit-learn Documentation (https://scikit-learn.org/)
Course Lecture Slides and Notes