0% found this document useful (0 votes)

73 views13 pages

Lecture 5

The document discusses linear regression, a parametric machine learning technique. It introduces linear regression models, where the output is a linear combination of the inputs plus noise. The parameters of the model are learned from training data by minimizing a loss function, such as squared error loss. This results in a least squares problem that can be solved in closed form to find the optimal parameter values, allowing predictions on new data based solely on the learned parameters.

Uploaded by

Mohit Garg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

73 views13 pages

Lecture 5

Uploaded by

Mohit Garg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

APL 405: Machine Learning for Mechanics

Lecture 5: Linear regression

Rajdip Nayek
Assistant Professor,
Applied Mechanics Department,
IIT Delhi

Instructor email: [email protected]

Introduction to basic parametric models

• We introduced the supervised machine learning problem as well as two basic non-parametric methods
• 𝑘NN and Decision Trees
• Non-parametric methods don’t have a fixed set of parameters

• Now we will look at some basic parametric modelling techniques, particularly

• Linear regression
• Logistic regression

• Parametric model
• Models that have a certain defined form and have a fixed set of parameters 𝜽 which are learned from
training data
• Once the parameters are learned, the training data can be discarded, and predictions depend only on 𝜽

2
Linear Regression
▪ In both regression and classification settings, we seek a function 𝑓(𝐱 ∗ ) that maps the test input 𝐱 ∗ to a prediction

▪ Regression → learn relationships between some input variables 𝐱 = 𝑥1 𝑥2 … 𝑥𝑝 𝑇 and a numerical output 𝑦
▪ The inputs can be either categorical or numerical, but let’s consider that all 𝑝 inputs are numerical
▪ Mathematically, regression is about learning a model 𝑓 that maps the input to the output
𝑦 =𝑓 𝐱 +𝜖
▪ 𝜖 is an error term that describes everything about the input-output relationship that cannot be captured by the model
▪ From a statistical perspective, 𝜖 is considered as a random variable and referred to as noise, that is independent of 𝐱 and
has zero mean

▪ Linear regression model: Output 𝑦 (a scalar) is an affine combination of 𝑝 input variables 𝑥1, 𝑥2 ,… , 𝑥𝑝 plus a noise term

𝑦 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 + ⋯ + 𝜃𝑝 𝑥𝑝 + 𝜖

▪ 𝜃0 , 𝜃1 , 𝜃2 , … , 𝜃𝑝 are called the parameters of the model

3
Linear Regression
▪ Linear regression model: Output 𝑦 (a scalar) is an affine combination of 𝑝 + 1 input variables 1, 𝑥1, 𝑥2 ,… , 𝑥𝑝 plus a noise
term
𝜃0
𝜃1 𝑇
𝑦 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2 + ⋯ + 𝜃𝑝 𝑥𝑝 + 𝜖 = 1 𝑥1 … 𝑥𝑝
⋮ + 𝜖 = 𝐱 𝜽+𝜖
𝜃𝑝
▪ 𝜃0 , 𝜃1 , 𝜃2 , … , 𝜃𝑝 are called the parameters of the model

▪ Symbol 𝐱 is used both for the 𝑝 + 1 and 𝑝-dimensional versions of the input vector, with or without the constant one in
the leading position, respectively

▪ The linear regression model is a parametric function of the form 𝑓 𝐱 = 𝐱 𝑇 𝜽 + 𝜖

▪ The parameters 𝜽 can take arbitrary values, and the actual values that we assign to them will control the input–output
relationship described by the model

▪ Learning of the model → finding suitable values for 𝜽 based on observed training data
4
How to predict on test set?
▪ How to make predictions 𝑦ො 𝐱 ∗ for new previously unseen test inputs 𝐱 ∗ = 1 𝑥1∗ 𝑥2∗ … 𝑥𝑝∗ 𝑇 ?
▪ Let 𝜽෡ be the learned parameter value for the linear regression model
▪ Since the noise term 𝜖 is random with zero mean and independent of all observed variables, we replace 𝜖 with 0 in the
prediction

▪ Prediction takes form:

෡
𝑦ො 𝐱 ∗ = 𝐱 ∗ 𝑇 𝜽

5
Training a linear regression model from training data
𝑁
▪ Training data: 𝒯 = 𝐱 𝑖 , 𝑦 𝑖 𝑖=1

1
1 𝑇 1
𝐱 𝑥1𝑖 𝑦
2 𝑇 2
𝒚 = 𝐗𝜽 + 𝝐 , 𝐗= 𝐱 , 𝐱 𝑖
= 𝑥2𝑖 , 𝒚= 𝑦
⋮ ⋮
⋮ 𝑁
𝑁 𝑇 𝑦
𝐱 𝑥𝑝𝑖

▪ Here, 𝝐 is the vector of noise terms

𝑇
▪ Predicted outputs for training data, 𝒚ෝ = 𝑦ො 𝐱 1 𝑦ො 𝐱 2
⋯ 𝑦ො 𝐱 𝑁

ෝ = 𝐗𝜽
𝒚

▪ Learning the unknown parameters 𝜽 amounts to finding their values such that 𝒚ෝ is “similar” to 𝒚
▪ “Similar” → finding 𝜽 such that 𝒚ෝ − 𝒚 = 𝝐 is small

▪ Formulate a loss function, which gives a mathematical meaning to “similarity” between 𝒚ෝ and 𝒚

6
How to define the problem of learning model parameters?
▪ Use loss function 𝐿 𝑦, 𝑦ො → measures the closeness of the model’s prediction 𝑦ො to the observed data 𝑦
▪ Smaller the loss, better the model fits the data, and vice versa
▪ Define average loss (or cost function) function, 𝐽 𝜽 , as the average loss over the training data
𝑁
1
𝐽 𝜽 = ෍ 𝐿 𝑦 𝑖 , 𝑦ො 𝐱 𝑖 ; 𝜽
𝑁
𝑖=1

▪ Training a model → finding the model parameters 𝜽 that minimize the average training loss
𝑁
1
෡
𝜽 = argmin 𝐽 𝜽 = argmin ෍ 𝐿 𝑦 𝑖 , 𝑦ො 𝐱 𝑖 ; 𝜽
𝜽 𝜽 𝑁
𝑖=1

▪ 𝑦ො 𝐱 𝑖 ; 𝜽 is the model prediction for the 𝐱 𝑖 training input and 𝑦 𝑖 is the corresponding training output
▪ The parameter 𝜽 has been put as an argument to denote the dependence of the prediction on it

▪ The operator argmin means ‘the value of 𝜽 for which the averaged loss function attains it minimum’
𝜽
7
Least squares problem
▪ For regression, a commonly used loss function is the squared error loss
2
𝐿 𝑦, 𝑦ො 𝐱; 𝜽 = 𝑦 − 𝑦ො 𝐱; 𝜽

▪ This loss function grows quadratically fast as the difference (𝑦 − 𝑦 ̂(𝐱; 𝜽)) increases

▪ The corresponding average loss function (or cost function)

𝑁
1 𝑖 𝑖
2 1 2
1 2
1 2
𝐽 𝜽 =𝐽 𝜽 = ෍ 𝑦 − 𝑦ො 𝐱 ; 𝜽 = ෝ
𝒚−𝒚 2 = 𝒚 − 𝐗𝜽 2 = 𝝐 2
𝑁 𝑁 𝑁 𝑁
𝑖=1

▪ Here, ⋅ 2
2 denotes the square of the Euclidean norm. Due to the square, it is called the least squares cost function

▪ In linear regression, the learning problem effectively finds the best parameter estimate
𝑁
1 𝑇 2 1
෡ = argmin
𝜽 ෍ 𝑦 𝑖 − 𝐱 (𝑖) 𝜽 = argmin 𝒚 − 𝐗𝜽 2
2
𝜽 𝑁 𝜽 𝑁
𝑖=1

▪ Closed-form solution exists → 𝜽෡ = 𝐗 𝑇 𝐗 −1 𝐗 𝑇 𝒚 if 𝐗 𝑇 𝐗 is invertible (will be an exercise in HW) 8

Linear regression algorithm
▪ Linear regression with squared error loss is very common in practice, due to its closed-form solution
▪ Other loss functions lead to optimization problems and often lack closed-form solutions

Training using linear regression model

Training Data: 𝒯 = 𝐱 1 ,𝑦 1 , 𝐱 2 ,𝑦 2 ,…, 𝐱 𝑁 ,𝑦 𝑁

1
෡ 𝐱 1 𝑇 𝑖 𝑦 1
Result: Learned parameter vector 𝜽 𝑥1
2 𝑇 2
𝐗= 𝐱 , 𝐱 𝑖 = 𝑥2
𝑖
, 𝒚= 𝑦
1. Construct matrix of input features 𝐗 and output vector 𝒚 ⋮ ⋮
⋮ 𝑁
𝑁 𝑇 𝑦
෡ by solving
2. Compute 𝜽 𝐗𝑇 𝐗 ෡=
𝜽 𝐗𝑇 𝒚 𝐱 𝑥𝑝
𝑖

Testing using linear regression model

෡
Data: Learned parameter vector 𝜽
Result: Prediction 𝑦ො 𝐱 ∗

1. Compute 𝑦ො 𝐱 ∗ = 𝐱 ∗ 𝑇 ෡
𝜽
9
A maximum likelihood perspective of least squares
▪ “Likelihood” refers to a statistical concept of a certain function which describes how likely is that a certain value of 𝜽 has
generated the measurements 𝒚

▪ Instead of selecting a loss function, one could start with the problem

෡ = argmax 𝑝 𝒚 𝐗; 𝜽
𝜽
𝜽

▪ 𝑝 𝒚 𝐗; 𝜽 is the probability density of all observed outputs 𝒚 in the training data, given all inputs 𝐗 and parameters 𝜽

▪ 𝑝 𝒚 𝐗; 𝜽 determines mathematically what ‘likely’ means

10
A maximum likelihood perspective of least squares
▪ Common assumption: Noise terms are independent and identically distributed (i.i.d.), each with a Gaussian distribution
(also known as a normal distribution) with mean zero and variance 𝜎𝜖2
𝜖 ~ 𝒩 𝜖; 0, 𝜎𝜖2

▪ Implies that all observed training data points are independent, and 𝑝 𝒚 𝐗; 𝜽 factorizes out as
𝑁
𝑖
𝑝 𝒚 𝐗; 𝜽 = ෑ 𝑝 𝑦 𝐱 𝑖 ;𝜽
𝑖=1

▪ The linear regression model, 𝑦 = 𝐱 𝑇 𝜽 + 𝜖, together with i.i.d. Gaussian noise assumption leads to
𝑖 𝑖 𝑖 𝑇 1 1 𝑇 2
𝑝 𝑦 𝐱 ;𝜽 = 𝒩 𝑦 ; 𝐱 (𝑖) 𝜽, 𝜎𝜖2 = exp − 2 𝑦 𝑖
− 𝐱 𝑖
𝜽
2𝜋𝜎𝜖2 2𝜎𝜖
▪ Recall, we want to maximize the likelihood w.r.t. the parameter 𝜽
▪ Better to work with logarithm of the likelihood (log-likelihood) to prevent numerical overflow
𝑁
ln 𝑝 𝒚 𝐗; 𝜽 = ෍ ln 𝑝 𝑦 𝑖 𝐱 𝑖 ;𝜽
𝑖=1

11
A maximum likelihood perspective of least squares
▪ Better to work with logarithm of the likelihood (log-likelihood) to prevent numerical overflow
𝑁
ln 𝑝 𝒚 𝐗; 𝜽 = ෍ ln 𝑝 𝑦 𝑖 𝐱 𝑖 ;𝜽
𝑖=1

▪ Logarithm is a monotonically increasing function, maximising the loglikelihood is equivalent to maximising the likelihood

▪ The linear regression model, 𝑦 = 𝐱 𝑇 𝜽 + 𝜖, together with i.i.d. Gaussian noise assumption leads to
𝑁 1 𝑁 𝑇 2
2 𝑖 𝑖
ln 𝑝 𝒚 𝐗; 𝜽 = − ln 2𝜋𝜎𝜖 − 2 ෍ 𝑦 − 𝐱 𝜽
2 2𝜎𝜖 𝑖=1

𝑁 𝑇 2 1 𝑁 𝑇 2
෡ = argmax ln 𝑝 𝒚 𝐗; 𝜽 = argmax − ෍
𝜽 𝑦 𝑖 − 𝐱 𝑖 𝜽 = argmin ෍ 𝑦 𝑖 − 𝐱 𝑖 𝜽
𝜽 𝜽 𝑖=1 𝜽 𝑁 𝑖=1

▪ Recall the same estimate is also obtained from linear regression with the least squares cost
▪ Using squared error loss is equivalent to assuming a Gaussian noise distribution in maximum likelihood formulation
▪ Other assumptions on 𝜖 lead to other loss functions (will discuss later)
12
How to handle categorical input variables?
▪ We had mentioned earlier that input variables 𝐱 can be numerical, catergorical, or mixed

▪ Assume that an input variable is categorical and takes only two classes, say A and B

0, if 𝐀
▪ We can represent such an input variable 𝑥 using 1 and 0 𝑥= ቊ
1, if 𝐁

▪ For linear regression, the model effectively looks like

𝜃0 + 𝜖, if 𝐀
𝑦 = 𝜃0 + 𝜃1 𝑥 + 𝜖 = ቊ
𝜃0 + 𝜃1 + 𝜖, if 𝐁

▪ If the input is a categorical variable with more than two classes, let’s say A, B, C, and D, use one-hot encoding

1 0 0 0
0 1 0 0
𝐱= if A, 𝐱= if B, 𝐱 = if C, 𝐱 = if D
0 0 1 0
0 0 0 1
13

Multiple Choice Questions For Class VIII: Crop Production and Management
60% (10)
Multiple Choice Questions For Class VIII: Crop Production and Management
21 pages
Partitioning Algorithms
No ratings yet
Partitioning Algorithms
5 pages
cs3235 3 PDF
No ratings yet
cs3235 3 PDF
142 pages
20 Machine Learning Projects For Beginners
No ratings yet
20 Machine Learning Projects For Beginners
22 pages
500 Quadratic Equation Questions Worksheet
No ratings yet
500 Quadratic Equation Questions Worksheet
94 pages
Model Predictive Control Stability
100% (1)
Model Predictive Control Stability
51 pages
Chapter 18 Operations Management
No ratings yet
Chapter 18 Operations Management
4 pages
System IDentification Programs
No ratings yet
System IDentification Programs
19 pages
Shannon's Theory of Secrecy: 3.1 Introduction To Attack and Security Assumptions
No ratings yet
Shannon's Theory of Secrecy: 3.1 Introduction To Attack and Security Assumptions
13 pages
Assignment Solved
No ratings yet
Assignment Solved
8 pages
Multidisciplinary System Design Optimization (MSDO)
No ratings yet
Multidisciplinary System Design Optimization (MSDO)
33 pages
Spare Parts Forecasting for 2024
No ratings yet
Spare Parts Forecasting for 2024
9 pages
Reading Graphs - White
No ratings yet
Reading Graphs - White
8 pages
AI Fundamentals Level 1 Quiz - Attempt Review
No ratings yet
AI Fundamentals Level 1 Quiz - Attempt Review
9 pages
Assignment#02 ITEC 332 Section 10551
No ratings yet
Assignment#02 ITEC 332 Section 10551
3 pages
Northwest Corner Method
No ratings yet
Northwest Corner Method
8 pages
1.TKS Tutorials Overview
No ratings yet
1.TKS Tutorials Overview
3 pages
Class VIII Science MCQs by A. Saha
No ratings yet
Class VIII Science MCQs by A. Saha
21 pages
B A B Ed 4year-2017-18 PDF
No ratings yet
B A B Ed 4year-2017-18 PDF
236 pages
Fake News Detection and Fact Verification Research Paper
No ratings yet
Fake News Detection and Fact Verification Research Paper
2 pages
Math Olympiad Problems 2017
No ratings yet
Math Olympiad Problems 2017
2 pages
Experiment #2: Continuous-Time Signal Representation I. Objectives
No ratings yet
Experiment #2: Continuous-Time Signal Representation I. Objectives
14 pages
Assignment 1 (
No ratings yet
Assignment 1 (
2 pages
Algorithm Design and Analysis (CS60007) Assignment 1: 1 Interval Scheduling
No ratings yet
Algorithm Design and Analysis (CS60007) Assignment 1: 1 Interval Scheduling
6 pages
Conver Flat File Into Staing Area
No ratings yet
Conver Flat File Into Staing Area
1 page
Difference Between Laplace Transform and Fourier Transform
No ratings yet
Difference Between Laplace Transform and Fourier Transform
3 pages
1-Zeroth-Hour, Course Syllabus & Discussion-04-01-2024
No ratings yet
1-Zeroth-Hour, Course Syllabus & Discussion-04-01-2024
3 pages
DSP
No ratings yet
DSP
95 pages
A Basic Introduction To Highscore XRD Analysis Software
0% (1)
A Basic Introduction To Highscore XRD Analysis Software
3 pages
Deep RL Tutorial Small
No ratings yet
Deep RL Tutorial Small
66 pages
FFT Analysis for MATLAB Users
No ratings yet
FFT Analysis for MATLAB Users
5 pages
Regional Mathematical Olympiad 2015 Questions With Solutions
No ratings yet
Regional Mathematical Olympiad 2015 Questions With Solutions
4 pages
2017 USAPhO Exam 2
No ratings yet
2017 USAPhO Exam 2
10 pages
Submitted By-Submitted To - : Dron Garg Class-Vii A Roll No.13 Mrs. Manju Bala MAM (Social Studies)
No ratings yet
Submitted By-Submitted To - : Dron Garg Class-Vii A Roll No.13 Mrs. Manju Bala MAM (Social Studies)
1 page
Course Outline 1614577744333
No ratings yet
Course Outline 1614577744333
1 page
Lecture 2
No ratings yet
Lecture 2
22 pages
L
No ratings yet
L
1 page
Advanced Machine Learning Lab Syllabus
No ratings yet
Advanced Machine Learning Lab Syllabus
4 pages
Chapter Social Responsibility
No ratings yet
Chapter Social Responsibility
1 page
Consolidated Performance Report
No ratings yet
Consolidated Performance Report
1 page
Machine Learning for Mechanics
No ratings yet
Machine Learning for Mechanics
19 pages
Lecture 3
No ratings yet
Lecture 3
17 pages
Autodesk Inventor - Convergence With Mesh Refinement
No ratings yet
Autodesk Inventor - Convergence With Mesh Refinement
4 pages
WBUT Numerical Method Paper 2012
No ratings yet
WBUT Numerical Method Paper 2012
7 pages
A B C D E F: Q1: The Parts of A Machine As Shown in Fig. 1 Are Given in Adjoining
No ratings yet
A B C D E F: Q1: The Parts of A Machine As Shown in Fig. 1 Are Given in Adjoining
12 pages
Academic Calendar 2016 17 PDF
No ratings yet
Academic Calendar 2016 17 PDF
16 pages

Lecture 5

Uploaded by

Lecture 5

Uploaded by

APL 405: Machine Learning for Mechanics

Lecture 5: Linear regression

Instructor email: [email protected]

• Now we will look at some basic parametric modelling techniques, particularly

▪ 𝜃0 , 𝜃1 , 𝜃2 , … , 𝜃𝑝 are called the parameters of the model

▪ The linear regression model is a parametric function of the form 𝑓 𝐱 = 𝐱 𝑇 𝜽 + 𝜖

▪ Prediction takes form:

▪ Here, 𝝐 is the vector of noise terms

▪ The corresponding average loss function (or cost function)

▪ Closed-form solution exists → 𝜽෡ = 𝐗 𝑇 𝐗 −1 𝐗 𝑇 𝒚 if 𝐗 𝑇 𝐗 is invertible (will be an exercise in HW) 8

Training using linear regression model

Training Data: 𝒯 = 𝐱 1 ,𝑦 1 , 𝐱 2 ,𝑦 2 ,…, 𝐱 𝑁 ,𝑦 𝑁

Testing using linear regression model

▪ 𝑝 𝒚 𝐗; 𝜽 determines mathematically what ‘likely’ means

▪ For linear regression, the model effectively looks like

You might also like