CS601 Machine Learning Unit 2 Notes 1672759753
CS601 Machine Learning Unit 2 Notes 1672759753
Syllabus: Linearity vs non linearity, activation functions like sigmoid, ReLU, etc., weights and
bias, loss function, gradient descent, multilayer network, back propagation, weight
initialization, training, testing, unstable gradient problem, auto encoders, batch normalization,
dropout, L1 and L2 regularization, momentum, tuning hyper parameters
Sigmoid
Sigmoid takes a real value as input and outputs another value between 0 and 1. It’s easy to
work with and has all the nice properties of activation functions: it’s non-linear, continuously
differentiable, monotonic, and has a fixed output range.
ReLU
It is a recent invention which stands for Rectified Linear Units. The formula is deceptively
simple: max(0,z). Despite its name and appearance, it’s not linear and provides the same
benefits as Sigmoid but with better performance.
Bias
It is an extra input to neurons and it is always 1, and has its own connection weight. This makes
sure that even when all the inputs are none (all 0’s) there’s going to be activation in the neuron.
Bias terms are additional constants attached to neurons and added to the weighted input
before the activation function is applied. Bias terms help models represent patterns that do not
necessarily pass through the origin. For example, if all your features were 0, would your output
also be zero? Is it possible there is some base value upon which your features have an effect?
Bias terms typically accompany weights and must also be learned by your model.
Loss Functions
A loss function, or cost function, is a wrapper, around our model predict function that tells us
“how good” the model is at making predictions for a given set of parameters. The loss function
has its own curve and its own derivatives. The slope of this curve tells us how to change our
parameters to make the model more accurate! We use the model to make predictions. We use
the cost function to update our parameters. Our cost function can take a variety of forms as
there are many different cost functions available. Popular loss functions include: MSE (L2) and
Cross-entropy Loss.
The loss function computes the error for a single training example. The cost function is the
average of the loss functions of the entire training set.
‘mse’: for mean squared error.
‘binary_crossentropy’: for binary logarithmic loss (logloss).
‘categorical_crossentropy’: for multi-class logarithmic loss (logloss).
Gradient Descent
Optimization is a big part of machine learning. Almost every machine learning algorithm has an
optimization algorithm at its core.
Gradient descent is an optimization algorithm used to find the values of parameters
(coefficients) of a function (f) that minimizes a cost function (cost).
Gradient descent is best used when the parameters cannot be calculated analytically (e.g. using
linear algebra) and must be searched for by an optimization algorithm.
Gradient Descent Procedure
The procedure starts off with initial values for the coefficient or coefficients for the function.
These could be 0.0 or a small random value.
coefficient = 0.0
The cost of the coefficients is evaluated by plugging them into the function and calculating the
cost.
cost = f(coefficient)
or
cost = evaluate(f(coefficient))
The derivative of the cost is calculated. The derivative is a concept from calculus and refers to
the slope of the function at a given point. We need to know the slope so that we know the
direction (sign) to move the coefficient values in order to get a lower cost on the next iteration.
delta = derivative(cost)
Now that we know from the derivative which direction is downhill, we can now update the
coefficient values. A learning rate parameter (alpha) must be specified that controls how much
the coefficients can change on each update.
coefficient = coefficient – (alpha * delta)
This process is repeated until the cost of the coefficients (cost) is 0.0 or close enough to zero to
be good enough.
Multilayer network
A fully connected multi-layer neural network is called a Multilayer Perceptron (MLP).
It has 3 layers including one hidden layer. If it has more than 1 hidden layer, it is called a deep
ANN.
An MLP is a typical example of a feed-forward artificial neural network.
In this figure, the ithactivation unit in the lth layer is denoted as ai(l).
The number of layers and the number of neurons are referred to as hyper parameters of a
neural network, and these need tuning. Cross-validation techniques must be used to find ideal
values for these.
The weight adjustment training is done via back propagation.
Back propagation
Back propagation is a supervised learning technique for neural networks that calculates the
gradient of descent for weighting different variables. It’s short for the backward propagation of
errors, since the error is computed at the output and distributed backwards throughout the
network’s layers.
When an artificial neural network discovers an error, the algorithm calculates the gradient of
the error function, adjusted by the network’s various weights. The gradient for the final layer of
weights is calculated first, with the first layer’s gradient of weights calculated last. Partial
calculations of the gradient from one layer are reused to determine the gradient for the
previous layer. This point of this backwards method of error checking is to more efficiently
calculate the gradient at each layer than the traditional approach of calculating each layer’s
gradient separately.
Weight Initialization
The weights of a network to be trained by backprop must be initialized to some non-
zero values.
The usual thing to do is to initialize the weights to small random values.
The reason for this is that sometimes backprop training runs become "lost" on a plateau
in weight-space, or for some other reason backprop cannot find a good minimum error
value.
Using small random values means different starting points for each training run, so that
subsequent training runs have a good chance of finding a suitable minimum.
The basic algorithm can be summed up in the following equation (the delta rule) for the
change to the weight wji from node i to node j:
1. If node j is an output node, then δj is the product of φ'(vj) and the error signal ej, where
φ(_) is the logistic function and vj is the total input to node j (i.e. Σi wjiyi), and ej is the
error signal for node j (i.e. the difference between the desired output and the actual
output);
2. If node j is a hidden node, then δj is the product of φ'(vj) and the weighted sum of the
δ's computed for the nodes in the next hidden or output layer that are connected to
node j.
3. The formula is δj = φ'(vj) &Sigmak δkwkj where k ranges over those nodes for which wkj is
non-zero (i.e. nodes k that actually have connections from node j. The δk values have
already been computed as they are in the output layer (or a layer closer to the output
layer than node j).
Auto encoders
An auto encoder neural network is an Unsupervised Machine learning algorithm that applies
back propagation, setting the target values to be equal to the inputs. Auto encoders are used to
reduce the size of our inputs into a smaller representation. If anyone needs the original data,
they can reconstruct it from the compressed data.
Auto encoders are preferred over PCA because:
An auto encoder can learn non-linear transformations with a non-linear activation
function and multiple layers.
It doesn’t have to learn dense layers. It can use convolutional layers to learn which is
better for video, image and series data.
It is more efficient to learn several layers with an auto encoder rather than learn one
huge transformation with PCA.
An auto encoder provides a representation of each layer as the output.
It can make use of pre-trained layers from another model to apply transfer learning to
enhance the encoder/decoder.
Applications of Auto encoders
1. Image Coloring
Auto encoders are used for converting any black and white picture into a colored image.
Depending on what is in the picture, it is possible to tell what the color should be.
2. Feature variation
It extracts only the required features of an image and generates the output by removing any
noise or unnecessary interruption.
3. Dimensionality Reduction
The reconstructed image is the same as our input but with reduced dimensions. It helps in
providing the similar image with a reduced pixel value.
4. Denoising Image
The input seen by the auto encoder is not the raw input but a stochastically corrupted version.
A denoising auto encoder is thus trained to reconstruct the original input from the noisy
version.
5. Watermark Removal
It is also used for removing watermarks from images or to remove any object while filming a
video or a movie.
Architecture of Auto encoders: An Auto encoder consists of three layers:
1. Encoder
2. Code
3. Decoder
Batch normalization
Batch normalization is one of the important features we add to our model helps as a
Regularizer, normalizing the inputs, in the back propagation process, and can be adapted to
most of the models to converge better.
How Does Batch Normalization work?
Batch normalization is a feature that we add between the layers of the neural network and it
continuously takes the output from the previous layer and normalizes it before sending it to the
next layer. This has the effect of stabilizing the neural network. Batch normalization is also used
to maintain the distribution of the data.
Figure 2.6: Working of Back Normalization
The problem we have in neural networks is the internal covariate shift. When we are training
our neural network, the distribution of data changes and the model trains slower. This problem
is framed as an internal covariate shift. To maintain the similar distribution of data we use batch
normalization by normalizing the outputs using mean=0, standard dev=1 (μ=0,σ=1). By using
this technique, the model is trained faster and it also increases the accuracy of the model
compared to a model that does not use the batch normalization.
Dropout
Dropout is implemented per-layer in a neural network.
It can be used with most types of layers, such as dense fully connected layers, convolutional
layers, and recurrent layers such as the long short-term memory network layer.
Dropout may be implemented on any or all hidden layers in the network as well as the visible or
input layer. It is not used on the output layer.
The term “dropout” refers to dropping out units (hidden and visible) in a neural network.
Simply, dropout refers to ignoring units (i.e. neurons) during the training phase of certain set of
neurons which is chosen at random. By “ignoring”, mean these units are not considered during a
particular forward or backward pass.
More technically, at each training stage, individual nodes are either dropped out of the net with
probability 1-p or kept with probability p, so that a reduced network is left; incoming and
outgoing edges to a dropped-out node are also removed.
Neural networks are the building blocks of any machine-learning architecture. They consist of
one input layer, one or more hidden layers, and an output layer.
When we training our neural network (or model) by updating each of its weights, it might
become too dependent on the dataset we are using. Therefore, when this model has to make a
prediction or classification, it will not give satisfactory results. This is known as over-fitting. We
might understand this problem through a real-world example: If a student of mathematics learns
only one chapter of a book and then takes a test on the whole syllabus, he will probably fail.
To overcome this problem, we use a technique that was introduced by Geoffrey Hinton in 2012.
This technique is known as dropout.
L1 regularization (Lasso Regression)
This regularization technique performs L1 regularization. Unlike Ridge Regression, it modifies the
RSS by adding the penalty (shrinkage quantity) equivalent to the sum of the absolute value of
coefficients.
Looking at the equation below, we can observe that similar to Ridge Regression, Lasso (Least
Absolute Shrinkage and Selection Operator) also penalizes the absolute size of the regression
coefficients. In addition to this, it is quite capable of reducing the variability and improving the
accuracy of linear regression models.
=argmin || y – Xβ ||22 + λ ||β1
Limitation of Lasso Regression:
If the number of predictors (p) is greater than the number of observations (n), Lasso will
pick at most n predictors as non-zero, even if all predictors are relevant (or may be used
in the test set). In such cases, Lasso sometimes really has to struggle with such types of
data.
If there are two or more highly collinear variables, then LASSO regression select one of
them randomly this is not good for the interpretation of data.
Lasso regression differs from ridge regression in a way that it uses absolute values within the
penalty function, rather than that of squares. This leads to penalizing (or equivalently
constraining the sum of the absolute values of the estimates) values which causes some of the
parameter estimates to turn out exactly zero. The more penalties are applied, the more the
estimates get shrunk towards absolute zero. This helps to variable selection out of given range of
n variables.
Momentum methods in the context of machine learning refer to a group of tricks and
techniques designed to speed up convergence of first order optimization methods like gradient
descent (and its many variants).
They essentially work by adding what’s called the momentum term to the update formula for
gradient descent, thereby ameliorating its natural “zigzagging behavior,” especially in long
narrow valleys of the cost function.
The figure shows the progress of gradient descent - with and without momentum - towards
reaching the minimum of a quadratic cost function, located at the center of the concentric
elliptical contours.
Let’s say your first update to the weights is a vector θ1θ1. For the second update (which would
be θ2θ2 without momentum) you update by θ2+αθ1θ2+αθ1. For the next one, you update
by θ3+αθ2+α2θ1θ3+αθ2+α2θ1, and so on. Here the parameter 0≤α<10≤α<1 indicates the
amount of momentum we want.
The practical way of doing that is keeping an update vector vi and updating it
as vi+1=αvi+θi+1vi+1=αvi+θi+1.
The reason we do this is to avoid the algorithm getting stuck in a local minimum. Think of it as a
marble rolling around on a curved surface. We want to get to the lowest point. The marble
having momentum will allow it to avoid a lot of small dips and make it more likely to find a
better local solution.
Having momentum too high means you will be more likely to overshoot (the marble goes
through the local minimum but the momentum carries it back upwards for a bit). This will lead
to longer learning times. Finding the correct value of the momentum will depend on the
particular problem: the smoothness of the function, how many local minima you expect, how
“deep” the sub-optimal local minima are expected to be, etc.
Tuning hyper parameters
Hyper parameters that cannot be directly learned from the regular training process are usually
fixed before the actual training process begins. These parameters express important properties
of the model such as its complexity or how fast it should learn.
Some examples of model hyper parameters include:
1. The penalty in Logistic Regression Classifier i.e. L1 or L2 regularization
2. The learning rate for training a neural network.
3. The C and sigma hyper parameters for support vector machines.
4. The k in k-nearest neighbours.
Models can have many hyper parameters and finding the best combination of parameters can
be treated as a search problem. Two best strategies for Hyper parameter tuning are:
1. GridSearchCV
2. RandomizedSearchCV
1. GridSearchCV
In GridSearchCV approach, machine learning model is evaluated for a range of hyper parameter
values. This approach is called GridSearchCV, because it searches for best set of hyper
parameters from a grid of hyper parameters values.
For example, if we want to set two hyper parameters C and Alpha of Logistic Regression
Classifier model, with different set of values. The gridsearch technique will construct many
versions of the model with all possible combinations of hyper parameters, and will return the
best one.
As in the image, for C = [0.1, 0.2, 0.3, 0.4, 0.5] and Alpha = [0.1, 0.2, 0.3, 0.4].
For a combination C=0.3 and Alpha=0.2, performance score comes out to be 0.726(Highest),
therefore it is selected.
Drawback: GridSearchCV will go through all the intermediate combinations of hyper
parameters which makes grid search computationally very expensive.
2. Randomized Search CV
RandomizedSearchCV solves the drawbacks of GridSearchCV, as it goes through only a fixed
number of hyper parameter settings. It moves within the grid in random fashion to find the best
set hyper parameters. This approach reduces unnecessary computation.
RandomizedSearchCV implements a “fit” and a “score” method. It also implements
“score_samples”, “predict”, “predict_proba”, “decision_function”, “transform” and
“inverse_transform” if they are implemented in the estimator used.
The parameters of the estimator used to apply these methods are optimized by cross-validated
search over parameter settings.
In contrast to GridSearchCV, not all parameter values are tried out, but rather a fixed number of
parameter settings is sampled from the specified distributions. The number of parameter
settings that are tried is given by n_iter.
If all parameters are presented as a list, sampling without replacement is performed. If at least
one parameter is given as a distribution, sampling with replacement is used. It is highly
recommended to use continuous distributions for continuous parameters.