This work is based on examples from course https://www.coursera.org/learn/machine-learning-calculus prepared by author Luis Serrano.
Linear separation refers to data points in binary classification problems that can be separated by a linear decision boundary. If the data points can be separated by a line, linear function, or flat hyperplane, they are said to be linearly separable.
If separate points in an n-dimensional space exist, then it is said to be linearly separable
For two-dimensional input data, if there is a line, whose equation is
that separates all samples of one class from the other class, then the corresponding observation can be derived from the equation of the separating line. Such classification problems are called "linearly separable", i.e. separating by linear combination.
The input layer contains two nodes
To be able to perform classification we need nonlinear approach. This can achieved with sigmoid activation function which roughly replace values with nearly 0 or nearly 1 for most cases and some values between for small range near 0.
Sigmoid activation function is defined as
Threshold value of
The single perceptron neural network with sigmoid activation function can be expressed as:
\begin{align} z^{(i)} &= W x^{(i)} + b,\ a^{(i)} &= \sigma\left(z^{(i)}\right).\\tag{3} \end{align}
With
\begin {align} Z &= W X + b,\ A &= \sigma\left(Z\right),\\tag{4} \end{align}
When dealing with classification problems, the most commonly used cost function is the log loss, which is described by the following equation
where
We want to minimize the cost function during the training. To implement gradient descent, calculate partial derivatives using chain rule
\begin{align} \frac{\partial \mathcal{L} }{ \partial w_1 } &= \frac{1}{m}\sum_{i=1}^{m} \left(a^{(i)} - y^{(i)}\right)x_1^{(i)},\ \frac{\partial \mathcal{L} }{ \partial w_2 } &= \frac{1}{m}\sum_{i=1}^{m} \left(a^{(i)} - y^{(i)}\right)x_2^{(i)},\tag{7}\ \frac{\partial \mathcal{L} }{ \partial b } &= \frac{1}{m}\sum_{i=1}^{m} \left(a^{(i)} - y^{(i)}\right). \end{align}
Equations above can be rewritten in a matrix form
\begin{align} \frac{\partial \mathcal{L} }{ \partial W } &= \begin{bmatrix} \frac{\partial \mathcal{L} }{ \partial w_1 } & \frac{\partial \mathcal{L} }{ \partial w_2 }\end{bmatrix} = \frac{1}{m}\left(A - Y\right)X^T,\ \frac{\partial \mathcal{L} }{ \partial b } &= \frac{1}{m}\left(A - Y\right)\mathbf{1}. \tag{8} \end{align}
where
Then you can update the parameters:
\begin{align} W &= W - \alpha \frac{\partial \mathcal{L} }{ \partial W },\ b &= b - \alpha \frac{\partial \mathcal{L} }{ \partial b }, \tag{9}\end{align}
where
in last step apply activation $$\hat{y} = \begin{cases} 1 & \mbox{if } a > 0.5 \ 0 & \mbox{otherwise } \end{cases}\tag{10}$$
As a dataset we will generate NumPy
array X
of a shape Y
of a shape