SVM notes unit 4.docx
SVM notes unit 4.docx
One reasonable choice as the best hyperplane is the one that represents the
largest separation or margin between the two classes.
So we choose the hyperplane whose distance from it to the nearest data point on
each side is maximized. If such a hyperplane exists it is known as
the maximum-margin hyperplane/hard margin. So from the above figure, we
choose L2. Let’s consider a scenario like shown below
Selecting hyperplane for data with outlier
Here we have one blue ball in the boundary of the red ball. So how does SVM
classify the data? It’s simple! The blue ball in the boundary of red ones is an
outlier of blue balls. The SVM algorithm has the characteristics to ignore the
outlier and finds the best hyperplane that maximizes the margin. SVM is robust to
outliers.
So in this type of data point what SVM does is, finds the maximum margin as done
with previous data sets along with that it adds a penalty each time a point crosses
the margin. So the margins in these types of cases are called soft margins. When
there is a soft margin to the data set, the SVM tries to
minimize (1/margin+∧(∑penalty)). Hinge loss is a commonly used penalty. If no
violations no hinge loss. If violations hinge loss proportional to the distance of
violation.
Till now, we were talking about linearly separable data(the group of blue balls
and red balls are separable by a straight line/linear line). What to do if data are
not linearly separable?
Say, our data is shown in the figure above. SVM solves this by creating a new
variable using a kernel. We call a point xi on the line and we create a new variable
yi as a function of distance from origin o.so if we plot this we get something like as
shown below
Consider a binary classification problem with two classes, labeled as +1 and -1.
We have a training dataset consisting of input feature vectors X and their
corresponding class labels Y.
The equation for the linear hyperplane can be written as:
The vector W represents the normal vector to the hyperplane. i.e the direction
perpendicular to the hyperplane. The parameter b in the equation represents the
offset or distance of the hyperplane from the origin along the normal vector w.
The distance between a data point x_i and the decision boundary can be
calculated as:
where ||w|| represents the Euclidean norm of the weight vector w. Euclidean
norm of the normal vector W
For Linear SVM classifier :
Optimization:
● For Hard margin linear SVM classifier:
The target variable or label for the ith training instance is denoted by the symbol
ti in this statement. And ti=-1 for negative occurrences (when yi= 0) and
ti=1positive instances (when yi = 1) respectively. Because we require the decision
boundary that satisfy the constraint
● For Soft margin linear SVM classifier:
where,
● αi is the Lagrange multiplier associated with the ith training sample.
● K(xi, xj) is the kernel function that computes the similarity between two
samples xi and xj. It allows SVM to handle nonlinear classification problems by
implicitly mapping the samples into a higher-dimensional feature space.
● The term ∑αi represents the sum of all Lagrange multipliers.
The SVM decision boundary can be described in terms of these optimal Lagrange
multipliers and the support vectors once the dual issue has been solved and the
optimal Lagrange multipliers have been discovered. The training samples that
have i > 0 are the support vectors, while the decision boundary is supplied by:
Based on the nature of the decision boundary, Support Vector Machines (SVM)
can be divided into two main parts:
● Linear SVM: Linear SVMs use a linear decision boundary to separate the data
points of different classes. When the data can be precisely linearly separated,
linear SVMs are very suitable. This means that a single straight line (in 2D) or
a hyperplane (in higher dimensions) can entirely divide the data points into
their respective classes. A hyperplane that maximizes the margin between the
classes is the decision boundary.
● Non-Linear SVM: Non-Linear SVM can be used to classify data when it cannot
be separated into two classes by a straight line (in the case of 2D). By using
kernel functions, nonlinear SVMs can handle nonlinearly separable data. The
original input data is transformed by these kernel functions into a
higher-dimensional feature space, where the data points can be linearly
separated. A linear SVM is used to locate a nonlinear decision boundary in this
modified space.
The SVM kernel is a function that takes low-dimensional input space and
transforms it into higher-dimensional space, ie it converts nonseparable
problems to separable problems. It is mostly useful in non-linear separation
problems. Simply put the kernel, does some extremely complex data
transformations and then finds out the process to separate the data based on the
labels or outputs defined.
Advantages of SVM
● Effective in high-dimensional cases.
● Its memory is efficient as it uses a subset of training points in the decision
function called support vectors.
● Different kernel functions can be specified for the decision functions and its
possible to specify custom kernels.