0% found this document useful (0 votes)
73 views13 pages

Lecture 5

The document discusses linear regression, a parametric machine learning technique. It introduces linear regression models, where the output is a linear combination of the inputs plus noise. The parameters of the model are learned from training data by minimizing a loss function, such as squared error loss. This results in a least squares problem that can be solved in closed form to find the optimal parameter values, allowing predictions on new data based solely on the learned parameters.

Uploaded by

Mohit Garg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views13 pages

Lecture 5

The document discusses linear regression, a parametric machine learning technique. It introduces linear regression models, where the output is a linear combination of the inputs plus noise. The parameters of the model are learned from training data by minimizing a loss function, such as squared error loss. This results in a least squares problem that can be solved in closed form to find the optimal parameter values, allowing predictions on new data based solely on the learned parameters.

Uploaded by

Mohit Garg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

APL 405: Machine Learning for Mechanics

Lecture 5: Linear regression

by

Rajdip Nayek
Assistant Professor,
Applied Mechanics Department,
IIT Delhi

Instructor email: [email protected]


Introduction to basic parametric models

• We introduced the supervised machine learning problem as well as two basic non-parametric methods
• 𝑘NN and Decision Trees
• Non-parametric methods don’t have a fixed set of parameters

• Now we will look at some basic parametric modelling techniques, particularly


• Linear regression
• Logistic regression

• Parametric model
• Models that have a certain defined form and have a fixed set of parameters 𝜽 which are learned from
training data
• Once the parameters are learned, the training data can be discarded, and predictions depend only on 𝜽

2
Linear Regression
▪ In both regression and classification settings, we seek a function 𝑓(𝐱 ∗ ) that maps the test input 𝐱 ∗ to a prediction

▪ Regression → learn relationships between some input variables 𝐱 = 𝑥1 𝑥2 … 𝑥𝑝 𝑇 and a numerical output 𝑦
▪ The inputs can be either categorical or numerical, but let’s consider that all 𝑝 inputs are numerical
▪ Mathematically, regression is about learning a model 𝑓 that maps the input to the output
𝑦 =𝑓 𝐱 +𝜖
▪ 𝜖 is an error term that describes everything about the input-output relationship that cannot be captured by the model
▪ From a statistical perspective, 𝜖 is considered as a random variable and referred to as noise, that is independent of 𝐱 and
has zero mean

▪ Linear regression model: Output 𝑦 (a scalar) is an affine combination of 𝑝 input variables 𝑥1, 𝑥2 ,… , 𝑥𝑝 plus a noise term

𝑦 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 + ⋯ + 𝜃𝑝 𝑥𝑝 + 𝜖

▪ 𝜃0 , 𝜃1 , 𝜃2 , … , 𝜃𝑝 are called the parameters of the model


3
Linear Regression
▪ Linear regression model: Output 𝑦 (a scalar) is an affine combination of 𝑝 + 1 input variables 1, 𝑥1, 𝑥2 ,… , 𝑥𝑝 plus a noise
term
𝜃0
𝜃1 𝑇
𝑦 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 + ⋯ + 𝜃𝑝 𝑥𝑝 + 𝜖 = 1 𝑥1 … 𝑥𝑝
⋮ + 𝜖 = 𝐱 𝜽+𝜖
𝜃𝑝
▪ 𝜃0 , 𝜃1 , 𝜃2 , … , 𝜃𝑝 are called the parameters of the model

▪ Symbol 𝐱 is used both for the 𝑝 + 1 and 𝑝-dimensional versions of the input vector, with or without the constant one in
the leading position, respectively

▪ The linear regression model is a parametric function of the form 𝑓 𝐱 = 𝐱 𝑇 𝜽 + 𝜖

▪ The parameters 𝜽 can take arbitrary values, and the actual values that we assign to them will control the input–output
relationship described by the model

▪ Learning of the model → finding suitable values for 𝜽 based on observed training data
4
How to predict on test set?
▪ How to make predictions 𝑦ො 𝐱 ∗ for new previously unseen test inputs 𝐱 ∗ = 1 𝑥1∗ 𝑥2∗ … 𝑥𝑝∗ 𝑇 ?
▪ Let 𝜽෡ be the learned parameter value for the linear regression model
▪ Since the noise term 𝜖 is random with zero mean and independent of all observed variables, we replace 𝜖 with 0 in the
prediction

▪ Prediction takes form:



𝑦ො 𝐱 ∗ = 𝐱 ∗ 𝑇 𝜽

5
Training a linear regression model from training data
𝑁
▪ Training data: 𝒯 = 𝐱 𝑖 , 𝑦 𝑖 𝑖=1

1
1 𝑇 1
𝐱 𝑥1𝑖 𝑦
2 𝑇 2
𝒚 = 𝐗𝜽 + 𝝐 , 𝐗= 𝐱 , 𝐱 𝑖
= 𝑥2𝑖 , 𝒚= 𝑦
⋮ ⋮
⋮ 𝑁
𝑁 𝑇 𝑦
𝐱 𝑥𝑝𝑖

▪ Here, 𝝐 is the vector of noise terms

𝑇
▪ Predicted outputs for training data, 𝒚ෝ = 𝑦ො 𝐱 1 𝑦ො 𝐱 2
⋯ 𝑦ො 𝐱 𝑁

ෝ = 𝐗𝜽
𝒚

▪ Learning the unknown parameters 𝜽 amounts to finding their values such that 𝒚ෝ is “similar” to 𝒚
▪ “Similar” → finding 𝜽 such that 𝒚ෝ − 𝒚 = 𝝐 is small

▪ Formulate a loss function, which gives a mathematical meaning to “similarity” between 𝒚ෝ and 𝒚

6
How to define the problem of learning model parameters?
▪ Use loss function 𝐿 𝑦, 𝑦ො → measures the closeness of the model’s prediction 𝑦ො to the observed data 𝑦
▪ Smaller the loss, better the model fits the data, and vice versa
▪ Define average loss (or cost function) function, 𝐽 𝜽 , as the average loss over the training data
𝑁
1
𝐽 𝜽 = ෍ 𝐿 𝑦 𝑖 , 𝑦ො 𝐱 𝑖 ; 𝜽
𝑁
𝑖=1

▪ Training a model → finding the model parameters 𝜽 that minimize the average training loss
𝑁
1

𝜽 = argmin 𝐽 𝜽 = argmin ෍ 𝐿 𝑦 𝑖 , 𝑦ො 𝐱 𝑖 ; 𝜽
𝜽 𝜽 𝑁
𝑖=1

▪ 𝑦ො 𝐱 𝑖 ; 𝜽 is the model prediction for the 𝐱 𝑖 training input and 𝑦 𝑖 is the corresponding training output
▪ The parameter 𝜽 has been put as an argument to denote the dependence of the prediction on it

▪ The operator argmin means ‘the value of 𝜽 for which the averaged loss function attains it minimum’
𝜽
7
Least squares problem
▪ For regression, a commonly used loss function is the squared error loss
2
𝐿 𝑦, 𝑦ො 𝐱; 𝜽 = 𝑦 − 𝑦ො 𝐱; 𝜽

▪ This loss function grows quadratically fast as the difference (𝑦 − 𝑦 ̂(𝐱; 𝜽)) increases

▪ The corresponding average loss function (or cost function)


𝑁
1 𝑖 𝑖
2 1 2
1 2
1 2
𝐽 𝜽 =𝐽 𝜽 = ෍ 𝑦 − 𝑦ො 𝐱 ; 𝜽 = ෝ
𝒚−𝒚 2 = 𝒚 − 𝐗𝜽 2 = 𝝐 2
𝑁 𝑁 𝑁 𝑁
𝑖=1

▪ Here, ⋅ 2
2 denotes the square of the Euclidean norm. Due to the square, it is called the least squares cost function

▪ In linear regression, the learning problem effectively finds the best parameter estimate
𝑁
1 𝑇 2 1
෡ = argmin
𝜽 ෍ 𝑦 𝑖 − 𝐱 (𝑖) 𝜽 = argmin 𝒚 − 𝐗𝜽 2
2
𝜽 𝑁 𝜽 𝑁
𝑖=1

▪ Closed-form solution exists → 𝜽෡ = 𝐗 𝑇 𝐗 −1 𝐗 𝑇 𝒚 if 𝐗 𝑇 𝐗 is invertible (will be an exercise in HW) 8


Linear regression algorithm
▪ Linear regression with squared error loss is very common in practice, due to its closed-form solution
▪ Other loss functions lead to optimization problems and often lack closed-form solutions

Training using linear regression model

Training Data: 𝒯 = 𝐱 1 ,𝑦 1 , 𝐱 2 ,𝑦 2 ,…, 𝐱 𝑁 ,𝑦 𝑁


1
෡ 𝐱 1 𝑇 𝑖 𝑦 1
Result: Learned parameter vector 𝜽 𝑥1
2 𝑇 2
𝐗= 𝐱 , 𝐱 𝑖 = 𝑥2
𝑖
, 𝒚= 𝑦
1. Construct matrix of input features 𝐗 and output vector 𝒚 ⋮ ⋮
⋮ 𝑁
𝑁 𝑇 𝑦
෡ by solving
2. Compute 𝜽 𝐗𝑇 𝐗 ෡=
𝜽 𝐗𝑇 𝒚 𝐱 𝑥𝑝
𝑖

Testing using linear regression model



Data: Learned parameter vector 𝜽
Result: Prediction 𝑦ො 𝐱 ∗

1. Compute 𝑦ො 𝐱 ∗ = 𝐱 ∗ 𝑇 ෡
𝜽
9
A maximum likelihood perspective of least squares
▪ “Likelihood” refers to a statistical concept of a certain function which describes how likely is that a certain value of 𝜽 has
generated the measurements 𝒚

▪ Instead of selecting a loss function, one could start with the problem

෡ = argmax 𝑝 𝒚 𝐗; 𝜽
𝜽
𝜽

▪ 𝑝 𝒚 𝐗; 𝜽 is the probability density of all observed outputs 𝒚 in the training data, given all inputs 𝐗 and parameters 𝜽

▪ 𝑝 𝒚 𝐗; 𝜽 determines mathematically what ‘likely’ means

10
A maximum likelihood perspective of least squares
▪ Common assumption: Noise terms are independent and identically distributed (i.i.d.), each with a Gaussian distribution
(also known as a normal distribution) with mean zero and variance 𝜎𝜖2
𝜖 ~ 𝒩 𝜖; 0, 𝜎𝜖2

▪ Implies that all observed training data points are independent, and 𝑝 𝒚 𝐗; 𝜽 factorizes out as
𝑁
𝑖
𝑝 𝒚 𝐗; 𝜽 = ෑ 𝑝 𝑦 𝐱 𝑖 ;𝜽
𝑖=1

▪ The linear regression model, 𝑦 = 𝐱 𝑇 𝜽 + 𝜖, together with i.i.d. Gaussian noise assumption leads to
𝑖 𝑖 𝑖 𝑇 1 1 𝑇 2
𝑝 𝑦 𝐱 ;𝜽 = 𝒩 𝑦 ; 𝐱 (𝑖) 𝜽, 𝜎𝜖2 = exp − 2 𝑦 𝑖
− 𝐱 𝑖
𝜽
2𝜋𝜎𝜖2 2𝜎𝜖
▪ Recall, we want to maximize the likelihood w.r.t. the parameter 𝜽
▪ Better to work with logarithm of the likelihood (log-likelihood) to prevent numerical overflow
𝑁
ln 𝑝 𝒚 𝐗; 𝜽 = ෍ ln 𝑝 𝑦 𝑖 𝐱 𝑖 ;𝜽
𝑖=1

11
A maximum likelihood perspective of least squares
▪ Better to work with logarithm of the likelihood (log-likelihood) to prevent numerical overflow
𝑁
ln 𝑝 𝒚 𝐗; 𝜽 = ෍ ln 𝑝 𝑦 𝑖 𝐱 𝑖 ;𝜽
𝑖=1

▪ Logarithm is a monotonically increasing function, maximising the loglikelihood is equivalent to maximising the likelihood

▪ The linear regression model, 𝑦 = 𝐱 𝑇 𝜽 + 𝜖, together with i.i.d. Gaussian noise assumption leads to
𝑁 1 𝑁 𝑇 2
2 𝑖 𝑖
ln 𝑝 𝒚 𝐗; 𝜽 = − ln 2𝜋𝜎𝜖 − 2 ෍ 𝑦 − 𝐱 𝜽
2 2𝜎𝜖 𝑖=1

𝑁 𝑇 2 1 𝑁 𝑇 2
෡ = argmax ln 𝑝 𝒚 𝐗; 𝜽 = argmax − ෍
𝜽 𝑦 𝑖 − 𝐱 𝑖 𝜽 = argmin ෍ 𝑦 𝑖 − 𝐱 𝑖 𝜽
𝜽 𝜽 𝑖=1 𝜽 𝑁 𝑖=1

▪ Recall the same estimate is also obtained from linear regression with the least squares cost
▪ Using squared error loss is equivalent to assuming a Gaussian noise distribution in maximum likelihood formulation
▪ Other assumptions on 𝜖 lead to other loss functions (will discuss later)
12
How to handle categorical input variables?
▪ We had mentioned earlier that input variables 𝐱 can be numerical, catergorical, or mixed

▪ Assume that an input variable is categorical and takes only two classes, say A and B

0, if 𝐀
▪ We can represent such an input variable 𝑥 using 1 and 0 𝑥= ቊ
1, if 𝐁

▪ For linear regression, the model effectively looks like


𝜃0 + 𝜖, if 𝐀
𝑦 = 𝜃0 + 𝜃1 𝑥 + 𝜖 = ቊ
𝜃0 + 𝜃1 + 𝜖, if 𝐁

▪ If the input is a categorical variable with more than two classes, let’s say A, B, C, and D, use one-hot encoding

1 0 0 0
0 1 0 0
𝐱= if A, 𝐱= if B, 𝐱 = if C, 𝐱 = if D
0 0 1 0
0 0 0 1
13

You might also like