SCHOOL OF TECHNOLOGY
BACHELOR OF INFORMATION SECURITY AND FORENSICS &
BACHELOR OF SOFTWARE DEVELOPMENT & BACHELOR IN INFORMATION
FORENSICS AND SECURITY
MACHINE LEARNING
JANUARY-APRIL 2023
ASSIGNMENT II
MEMBERS.
Ibrahim Hussein 19/05592 BISF
Moses Kipngeno 19/05914 BISF
Everlyne Nelius Irungu 19/05463 BISF
Alice Njeri Kuria 19/05790 BISF
Collins Njoroge 19/02573 BISF
ACTIVITY
1. Describe the Support Vector Machine algorithm.
Support Vector Machine (SVM) is a powerful machine learning algorithm used for
classification and regression tasks.
It works by finding the best hyper plane that separates the data points into different
classes in a high-dimensional space.
The SVM algorithm works through:
i. Data preprocessing: the input data is first preprocessed to ensure that it is in a suitable
format for Support Vector Machine. It may include scaling, normalization and other
transformations to ensure that the data is centered and the features are on similar
scales.
ii. Feature mapping: SVM maps the input data into a higher dimensional space using a
kernel function. This helps find a hyper plane that can effectively separate the data
points given.
iii. Hyper plane selection: SVM then searches for the optimal hyper plane that separates
the data points with maximum margin. The margin is (the distance between the hyper
plane and the closest data points from each class). The larger the margin, the more
confident the algorithm is about its classification.
iv. Support vector identification: The data points closest to the hyper plane on each side
are known as support vectors. These support vectors determine the position of the
hyper plane and are used to calculate the margin.
v. Classification: Once the optimal hyper plane is found, SVM uses it to classify new
data points based on which side of the hyper plane they fall on. If the data point falls
on the positive side of the hyper plane, it is classified as one class, and if it falls on
the negative side, it is classified as the other class.
SVM can therefore handle both linear and non-linearly separable data by using different
kernel functions. Kernel functions used in SVM include linear, polynomial, radial basis
function (RBF), and sigmoid.
SVM is a powerful algorithm for classification tasks and can handle high dimensional
datasets with complex decision boundaries as seen above.
SVM disadvantage is that it’s still not suitable for large datasets because of its high
training time.
2. Preprocess a selected dataset
Data preprocessing is the process of preparing the raw data and making it suitable for machine
learning models. Data preprocessing includes data cleaning for making the data ready to be given
to machine learning model
Below is a dataset containing student performances. We apply various data preprocessing
commands to the dataset as shown below.
import pandas as pd
import numpy as np
#read csv
df_excel = pd.read_csv('StudentsPerformance.csv')
df_excel
#first look
df_excel.describe()
#calculate specific columns
df_excel['math score'].sum()
df_excel['math score'].mean()
df_excel['math score'].max()
df_excel['math score'].min()
df_excel['math score'].count()
#calculate specific rows
df_excel['average'] = (df_excel['math score'] + df_excel['reading score']
+ df_excel['writing score'])/3
df_excel.mean(axis=1)
df_excel.head()
# count
df_excel['gender'].value_counts()
# if condition
df_excel['pass/fail'] = np.where(df_excel['average'] > 70, 'Pass', 'Fail')
df_excel.head()
# multiple conditions
conditions = [
(df_excel['average']>=90),
(df_excel['average']>=80) & (df_excel['average']<90),
(df_excel['average']>=70) & (df_excel['average']<80),
(df_excel['average']>=60) & (df_excel['average']<70),
(df_excel['average']>=50) & (df_excel['average']<60),
(df_excel['average']<50),
]
values = ['A', 'B', 'C', 'D', 'E', 'F']
df_excel['grades'] = np.select(conditions, values)
df_excel.head()
# show first 5 rows
df_excel[['average', 'pass/fail', 'grades']].head()
3. Using an example in Python and a sample dataset build an SVM model.
# Support Vector Machine
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the datasets
datasets = pd.read_csv('Social_Network_Ads.csv')
X = datasets.iloc[:, [2,3]].values
Y = datasets.iloc[:, 4].values
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_Train, X_Test, Y_Train, Y_Test = train_test_split(X, Y, test_size = 0.25,
random_state = 0)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_Train = sc_X.fit_transform(X_Train)
X_Test = sc_X.transform(X_Test)
# Fitting the classifier into the Training set
from sklearn.svm import SVC
classifier = SVC(kernel = 'linear', random_state = 0)
classifier.fit(X_Train, Y_Train)
# Predicting the test set results
Y_Pred = classifier.predict(X_Test)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(Y_Test, Y_Pred)
# Visualising the Training set results
from matplotlib.colors import ListedColormap
X_Set, Y_Set = X_Train, Y_Train
X1, X2 = np.meshgrid(np.arange(start = X_Set[:, 0].min() - 1, stop = X_Set[:,
0].max() + 1, step = 0.01),
np.arange(start = X_Set[:, 1].min() - 1, stop = X_Set[:,
1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(),
X2.ravel()]).T).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(Y_Set)):
plt.scatter(X_Set[Y_Set == j, 0], X_Set[Y_Set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Support Vector Machine (Training set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
# Visualising the Test set results
from matplotlib.colors import ListedColormap
X_Set, Y_Set = X_Test, Y_Test
X1, X2 = np.meshgrid(np.arange(start = X_Set[:, 0].min() - 1, stop = X_Set[:,
0].max() + 1, step = 0.01),
np.arange(start = X_Set[:, 1].min() - 1, stop = X_Set[:,
1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(),
X2.ravel()]).T).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(Y_Set)):
plt.scatter(X_Set[Y_Set == j, 0], X_Set[Y_Set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Support Vector Machine (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()