0% found this document useful (0 votes)
0 views

Lecture 10 Linear_Regression

The document provides an overview of supervised machine learning, specifically focusing on linear regression and the gradient descent optimization algorithm. It explains the concepts of simple linear regression, cost functions, and the mechanics of gradient descent, including its types and how to determine the best learning rates. Additionally, it discusses the iterative process of minimizing the cost function to achieve optimal model parameters.

Uploaded by

Alaaeee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Lecture 10 Linear_Regression

The document provides an overview of supervised machine learning, specifically focusing on linear regression and the gradient descent optimization algorithm. It explains the concepts of simple linear regression, cost functions, and the mechanics of gradient descent, including its types and how to determine the best learning rates. Additionally, it discusses the iterative process of minimizing the cost function to achieve optimal model parameters.

Uploaded by

Alaaeee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Supervised Machine Learning

Linear Regression
➢ Introduction
➢ Simple Linear Regression (1 Variable)
➢ Cost Function (MSE)
➢ Gradient Descent Optimization
Regression
➢ Linear Regression is a supervised learning technique
used to model the relationship between a dependent
variable (target) and one or more independent variables
(features).
➢ The goal is to predict the value of the dependent
variable based on the values of the independent
variables.
Linear Regression

Training Set

Learning Algorithm

Feature(s) h Prediction
Linear Regression with one variable
Linear regression aims to fit a linear equation to observed
data.
The simplest form, known as simple linear regression,
models the relationship between two variables (1 Feature
and the target) by fitting a linear equation to observed data.
Linear Regression with one variable
Given Training set of
housing prices

Size in
Price (y)
feet2 (x)

2104 460
1416 232
1534 315
852 178
… …
Linear Regression with one variable
Linear Regression Housing Prices
Price
(in 1000s of dollars)

Size in
Price (y)
feet2 (x)

2104 460
1416 232
1534 315 Size (feet2)
852 178
… …
Training Set

Learning Algorithm

Size of h Estimated
house price
Linear Regression with one variable
Linear Regression with one variable
m = Number of training examples
x’s = “input” variable / feature
y’s = “output” variable / “target” variable
(x,y) => one training example
(X(i), y(i)) => ith training example

Hypothesis:
‘s: Parameters
How to choose ‘s ?
Hypothesis:

Parameters:

Cost Function:
(Squared error function or least square mean

Goal:
Simplified
Hypothesis:

Parameters:

Cost Function
(Squared error function):

Goal:
(for fixed , this is a function of x) (function of the parameter )

x
x

J(0,1) =0
(for fixed , this is a function of x) (function of the parameter )

x
x
(for fixed , this is a function of x) (function of the parameter )

0.58
(for fixed , this is a function of x) (function of the parameter )

x
(for fixed , this is a function of x) (function of the parameter )
Contour Plot

Convex Function
(for fixed , this is a function of x) (function of the parameters )
Training Set

Learning Algorithm
(Gradient descent)

Feature h Prediction
Gradient (or steepest) descent
➢ It is an optimization algorithm used to minimize
the cost function in machine learning and deep
learning.
➢ It is a crucial part of training models
Gradient Descent
Objective (Cost) Function:
● The function that you want to minimize. In machine learning,
this is typically the loss function, which measures the difference
between the model's predictions and the actual values.
Parameters:

● The variables in the model that are adjusted to minimize the cost
function. In a linear regression model, for example, these would be
the coefficients of the line.
Gradient Descent
Gradient:
● The gradient is the vector of partial derivatives of the cost
function with respect to each parameter.

Learning Rate α:
● A hyperparameter that controls how much the
parameters are adjusted with respect to the gradient
during each update.
Gradient Descent
➢ Gradient descent works by moving downward toward the pits or
valleys in the graph to find the minimum value.
➢ This is achieved by taking the derivative of the cost function.
➢ During each iteration, gradient descent step-downs the cost function
in the direction of the steepest descent.
➢ By adjusting the parameters in this direction, it seeks to reach the
minimum of the cost function and find the best-fit values for the
parameters.
➢ The size of each step is determined by parameter α known as
Learning Rate.
Gradient Descent
J(θ0,θ1)

θ1
θ0
J(θ0,θ1)

θ1
θ0
Global minimum
J(θ0,θ1)

Local
optima
θ1
θ0
at local optima

Current value of
Gradient descent
Have some function
Want
Outline:
• Start with some
• Keep changing to reduce
until we hopefully end up at a minimum
Gradient descent algorithm

α is learning rate

Correct: Simultaneous update Incorrect:


Linear regression with one variable
Linear Regression Model Gradient descent algorithm

update
and
simultaneously
?

For j=0 →
To simplify, assume θ0=0

=slope
To simplify, assume θ0=0

=slope

If slope is +ve : θj = θj – (+ve value). Hence the value of θj decreases


● If slope is -ve : θj = θj – (-ve value). Hence the value of θj
increases.
Learning Rate

i.e If α is too small,


overshoot and fail to converge
gradient descent is slow.
Gradient descent can converge to a local
minimum, even with the learning rate α fixed.

As we approach a local minimum, gradient descent


will automatically take smaller steps. So, no need to
decrease α over time.
How does gradient descent converge with a fixed step size 𝛼 ?
The intuition behind the convergence is that the derivative of cost function
approaches 0 as we approach the bottom of our convex function. At the
minimum, the derivative will always be 0 and thus we get:
How to find the best learning rates
➢ There is no formula to find
the right learning rate. You
have try several values
before you find the right one.
This is called
hyperparameter tuning
How to find the best learning rates
➢ One strategy is to run gradient descent algorithm
with different values of learning rate and for each
value plot number of iterations versus cost
function
➢ for example: Try …0.001,0.003,0.01,0.03,0.1,0.3,1,..
➢ and each time plot learning curve (number of
iteration vs cost function) and choose the best value
The number of Iterations
➢ The cost function will decrease after each iteration if the
gradient descent is working optimally.
➢ Gradient descent converges when it fails to reduce the cost
function, and stays at the same level.
➢ The number of iterations required for gradient descent to
converge varies considerably. Sometimes it takes fifty
iterations, and other times it can be as many as two or three
million.
➢ It is difficult to estimate the number of iterations in
advance.
Making sure gradient descent is working correctly.

Example automatic convergence test:


Declare convergence if decreases by less than in one

iteration.

45
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
(for fixed , this is a function of x) (function of the parameters )
Gradient Descent types
1-Batch Gradient Descent: Each step of gradient descent
uses all the training examples (m).
2-Stochastic Gradient Descent (SGD)
You calculate the gradient using just a random small
part of the observations instead of all of them. In some
cases, this approach can reduce computation time.
Example

X Y
1 1
2 3
4 3
3 2
5 5
Example
X Y Assume Θ0 =0 and Θ1=0, and α=0.1

1 1
2 3
Iteration 1: h(x) = 0
4 3
3 2
5 5
Example
X Y Assume Θ0 =0 and Θ1=0, and α=0.1

1 1
2 3
Iteration 1: h(x) = 0
4 3
3 2
5 5 J(0,0)= 1/(2x5) [(0-1)2+(0-3)2+(0-3)2+(0-2)2+(0-5)2 ] = 4.8
Θ0 = 0-0.1/5 [-1 -3-3-2-5] =0.28
Θ1 = 0-0.1/5 [ (-1x1)+(-3x2)+(-3x4)+(-2x3)+(-5x5)] =1
h=0.28+x
J(0.28,1)= 1/(2x5) [(1.28-1)2+(2.28-3)2+(4.28-3)2+(3.28-2)2+(5.28-5)2 ] =
0.3952=0.4
Iteration 2:
Θ0 = 0.2, Θ1 = 0.816 → J(0.2,0.816) = 0.25 (h=0.2+0.816x)
Iteration 3:
Θ0 = 0.185, Θ1 = 0.776 → (h=0.185+0.776x)

You might also like