APL 405: Machine Learning for Mechanics
Lecture 5: Linear regression
by
Rajdip Nayek
Assistant Professor,
Applied Mechanics Department,
IIT Delhi
Introduction to basic parametric models
• We introduced the supervised machine learning problem as well as two basic non-parametric methods
• 𝑘NN and Decision Trees
• Non-parametric methods don’t have a fixed set of parameters
• Now we will look at some basic parametric modelling techniques, particularly
• Linear regression
• Logistic regression
• Parametric model
• Models that have a certain defined form and have a fixed set of parameters 𝜽 which are learned from
training data
• Once the parameters are learned, the training data can be discarded, and predictions depend only on 𝜽
2
Linear Regression
▪ In both regression and classification settings, we seek a function 𝑓(𝐱 ∗ ) that maps the test input 𝐱 ∗ to a prediction
▪ Regression → learn relationships between some input variables 𝐱 = 𝑥1 𝑥2 … 𝑥𝑝 𝑇 and a numerical output 𝑦
▪ The inputs can be either categorical or numerical, but let’s consider that all 𝑝 inputs are numerical
▪ Mathematically, regression is about learning a model 𝑓 that maps the input to the output
𝑦 =𝑓 𝐱 +𝜖
▪ 𝜖 is an error term that describes everything about the input-output relationship that cannot be captured by the model
▪ From a statistical perspective, 𝜖 is considered as a random variable and referred to as noise, that is independent of 𝐱 and
has zero mean
▪ Linear regression model: Output 𝑦 (a scalar) is an affine combination of 𝑝 input variables 𝑥1, 𝑥2 ,… , 𝑥𝑝 plus a noise term
𝑦 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 + ⋯ + 𝜃𝑝 𝑥𝑝 + 𝜖
▪ 𝜃0 , 𝜃1 , 𝜃2 , … , 𝜃𝑝 are called the parameters of the model
3
Linear Regression
▪ Linear regression model: Output 𝑦 (a scalar) is an affine combination of 𝑝 + 1 input variables 1, 𝑥1, 𝑥2 ,… , 𝑥𝑝 plus a noise
term
𝜃0
𝜃1 𝑇
𝑦 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 + ⋯ + 𝜃𝑝 𝑥𝑝 + 𝜖 = 1 𝑥1 … 𝑥𝑝
⋮ + 𝜖 = 𝐱 𝜽+𝜖
𝜃𝑝
▪ 𝜃0 , 𝜃1 , 𝜃2 , … , 𝜃𝑝 are called the parameters of the model
▪ Symbol 𝐱 is used both for the 𝑝 + 1 and 𝑝-dimensional versions of the input vector, with or without the constant one in
the leading position, respectively
▪ The linear regression model is a parametric function of the form 𝑓 𝐱 = 𝐱 𝑇 𝜽 + 𝜖
▪ The parameters 𝜽 can take arbitrary values, and the actual values that we assign to them will control the input–output
relationship described by the model
▪ Learning of the model → finding suitable values for 𝜽 based on observed training data
4
How to predict on test set?
▪ How to make predictions 𝑦ො 𝐱 ∗ for new previously unseen test inputs 𝐱 ∗ = 1 𝑥1∗ 𝑥2∗ … 𝑥𝑝∗ 𝑇 ?
▪ Let 𝜽 be the learned parameter value for the linear regression model
▪ Since the noise term 𝜖 is random with zero mean and independent of all observed variables, we replace 𝜖 with 0 in the
prediction
▪ Prediction takes form:
𝑦ො 𝐱 ∗ = 𝐱 ∗ 𝑇 𝜽
5
Training a linear regression model from training data
𝑁
▪ Training data: 𝒯 = 𝐱 𝑖 , 𝑦 𝑖 𝑖=1
1
1 𝑇 1
𝐱 𝑥1𝑖 𝑦
2 𝑇 2
𝒚 = 𝐗𝜽 + 𝝐 , 𝐗= 𝐱 , 𝐱 𝑖
= 𝑥2𝑖 , 𝒚= 𝑦
⋮ ⋮
⋮ 𝑁
𝑁 𝑇 𝑦
𝐱 𝑥𝑝𝑖
▪ Here, 𝝐 is the vector of noise terms
𝑇
▪ Predicted outputs for training data, 𝒚ෝ = 𝑦ො 𝐱 1 𝑦ො 𝐱 2
⋯ 𝑦ො 𝐱 𝑁
ෝ = 𝐗𝜽
𝒚
▪ Learning the unknown parameters 𝜽 amounts to finding their values such that 𝒚ෝ is “similar” to 𝒚
▪ “Similar” → finding 𝜽 such that 𝒚ෝ − 𝒚 = 𝝐 is small
▪ Formulate a loss function, which gives a mathematical meaning to “similarity” between 𝒚ෝ and 𝒚
6
How to define the problem of learning model parameters?
▪ Use loss function 𝐿 𝑦, 𝑦ො → measures the closeness of the model’s prediction 𝑦ො to the observed data 𝑦
▪ Smaller the loss, better the model fits the data, and vice versa
▪ Define average loss (or cost function) function, 𝐽 𝜽 , as the average loss over the training data
𝑁
1
𝐽 𝜽 = 𝐿 𝑦 𝑖 , 𝑦ො 𝐱 𝑖 ; 𝜽
𝑁
𝑖=1
▪ Training a model → finding the model parameters 𝜽 that minimize the average training loss
𝑁
1
𝜽 = argmin 𝐽 𝜽 = argmin 𝐿 𝑦 𝑖 , 𝑦ො 𝐱 𝑖 ; 𝜽
𝜽 𝜽 𝑁
𝑖=1
▪ 𝑦ො 𝐱 𝑖 ; 𝜽 is the model prediction for the 𝐱 𝑖 training input and 𝑦 𝑖 is the corresponding training output
▪ The parameter 𝜽 has been put as an argument to denote the dependence of the prediction on it
▪ The operator argmin means ‘the value of 𝜽 for which the averaged loss function attains it minimum’
𝜽
7
Least squares problem
▪ For regression, a commonly used loss function is the squared error loss
2
𝐿 𝑦, 𝑦ො 𝐱; 𝜽 = 𝑦 − 𝑦ො 𝐱; 𝜽
▪ This loss function grows quadratically fast as the difference (𝑦 − 𝑦 ̂(𝐱; 𝜽)) increases
▪ The corresponding average loss function (or cost function)
𝑁
1 𝑖 𝑖
2 1 2
1 2
1 2
𝐽 𝜽 =𝐽 𝜽 = 𝑦 − 𝑦ො 𝐱 ; 𝜽 = ෝ
𝒚−𝒚 2 = 𝒚 − 𝐗𝜽 2 = 𝝐 2
𝑁 𝑁 𝑁 𝑁
𝑖=1
▪ Here, ⋅ 2
2 denotes the square of the Euclidean norm. Due to the square, it is called the least squares cost function
▪ In linear regression, the learning problem effectively finds the best parameter estimate
𝑁
1 𝑇 2 1
= argmin
𝜽 𝑦 𝑖 − 𝐱 (𝑖) 𝜽 = argmin 𝒚 − 𝐗𝜽 2
2
𝜽 𝑁 𝜽 𝑁
𝑖=1
▪ Closed-form solution exists → 𝜽 = 𝐗 𝑇 𝐗 −1 𝐗 𝑇 𝒚 if 𝐗 𝑇 𝐗 is invertible (will be an exercise in HW) 8
Linear regression algorithm
▪ Linear regression with squared error loss is very common in practice, due to its closed-form solution
▪ Other loss functions lead to optimization problems and often lack closed-form solutions
Training using linear regression model
Training Data: 𝒯 = 𝐱 1 ,𝑦 1 , 𝐱 2 ,𝑦 2 ,…, 𝐱 𝑁 ,𝑦 𝑁
1
𝐱 1 𝑇 𝑖 𝑦 1
Result: Learned parameter vector 𝜽 𝑥1
2 𝑇 2
𝐗= 𝐱 , 𝐱 𝑖 = 𝑥2
𝑖
, 𝒚= 𝑦
1. Construct matrix of input features 𝐗 and output vector 𝒚 ⋮ ⋮
⋮ 𝑁
𝑁 𝑇 𝑦
by solving
2. Compute 𝜽 𝐗𝑇 𝐗 =
𝜽 𝐗𝑇 𝒚 𝐱 𝑥𝑝
𝑖
Testing using linear regression model
Data: Learned parameter vector 𝜽
Result: Prediction 𝑦ො 𝐱 ∗
1. Compute 𝑦ො 𝐱 ∗ = 𝐱 ∗ 𝑇
𝜽
9
A maximum likelihood perspective of least squares
▪ “Likelihood” refers to a statistical concept of a certain function which describes how likely is that a certain value of 𝜽 has
generated the measurements 𝒚
▪ Instead of selecting a loss function, one could start with the problem
= argmax 𝑝 𝒚 𝐗; 𝜽
𝜽
𝜽
▪ 𝑝 𝒚 𝐗; 𝜽 is the probability density of all observed outputs 𝒚 in the training data, given all inputs 𝐗 and parameters 𝜽
▪ 𝑝 𝒚 𝐗; 𝜽 determines mathematically what ‘likely’ means
10
A maximum likelihood perspective of least squares
▪ Common assumption: Noise terms are independent and identically distributed (i.i.d.), each with a Gaussian distribution
(also known as a normal distribution) with mean zero and variance 𝜎𝜖2
𝜖 ~ 𝒩 𝜖; 0, 𝜎𝜖2
▪ Implies that all observed training data points are independent, and 𝑝 𝒚 𝐗; 𝜽 factorizes out as
𝑁
𝑖
𝑝 𝒚 𝐗; 𝜽 = ෑ 𝑝 𝑦 𝐱 𝑖 ;𝜽
𝑖=1
▪ The linear regression model, 𝑦 = 𝐱 𝑇 𝜽 + 𝜖, together with i.i.d. Gaussian noise assumption leads to
𝑖 𝑖 𝑖 𝑇 1 1 𝑇 2
𝑝 𝑦 𝐱 ;𝜽 = 𝒩 𝑦 ; 𝐱 (𝑖) 𝜽, 𝜎𝜖2 = exp − 2 𝑦 𝑖
− 𝐱 𝑖
𝜽
2𝜋𝜎𝜖2 2𝜎𝜖
▪ Recall, we want to maximize the likelihood w.r.t. the parameter 𝜽
▪ Better to work with logarithm of the likelihood (log-likelihood) to prevent numerical overflow
𝑁
ln 𝑝 𝒚 𝐗; 𝜽 = ln 𝑝 𝑦 𝑖 𝐱 𝑖 ;𝜽
𝑖=1
11
A maximum likelihood perspective of least squares
▪ Better to work with logarithm of the likelihood (log-likelihood) to prevent numerical overflow
𝑁
ln 𝑝 𝒚 𝐗; 𝜽 = ln 𝑝 𝑦 𝑖 𝐱 𝑖 ;𝜽
𝑖=1
▪ Logarithm is a monotonically increasing function, maximising the loglikelihood is equivalent to maximising the likelihood
▪ The linear regression model, 𝑦 = 𝐱 𝑇 𝜽 + 𝜖, together with i.i.d. Gaussian noise assumption leads to
𝑁 1 𝑁 𝑇 2
2 𝑖 𝑖
ln 𝑝 𝒚 𝐗; 𝜽 = − ln 2𝜋𝜎𝜖 − 2 𝑦 − 𝐱 𝜽
2 2𝜎𝜖 𝑖=1
𝑁 𝑇 2 1 𝑁 𝑇 2
= argmax ln 𝑝 𝒚 𝐗; 𝜽 = argmax −
𝜽 𝑦 𝑖 − 𝐱 𝑖 𝜽 = argmin 𝑦 𝑖 − 𝐱 𝑖 𝜽
𝜽 𝜽 𝑖=1 𝜽 𝑁 𝑖=1
▪ Recall the same estimate is also obtained from linear regression with the least squares cost
▪ Using squared error loss is equivalent to assuming a Gaussian noise distribution in maximum likelihood formulation
▪ Other assumptions on 𝜖 lead to other loss functions (will discuss later)
12
How to handle categorical input variables?
▪ We had mentioned earlier that input variables 𝐱 can be numerical, catergorical, or mixed
▪ Assume that an input variable is categorical and takes only two classes, say A and B
0, if 𝐀
▪ We can represent such an input variable 𝑥 using 1 and 0 𝑥= ቊ
1, if 𝐁
▪ For linear regression, the model effectively looks like
𝜃0 + 𝜖, if 𝐀
𝑦 = 𝜃0 + 𝜃1 𝑥 + 𝜖 = ቊ
𝜃0 + 𝜃1 + 𝜖, if 𝐁
▪ If the input is a categorical variable with more than two classes, let’s say A, B, C, and D, use one-hot encoding
1 0 0 0
0 1 0 0
𝐱= if A, 𝐱= if B, 𝐱 = if C, 𝐱 = if D
0 0 1 0
0 0 0 1
13