0% found this document useful (0 votes)
11 views

Lecture 2

This document provides an overview of linear regression and machine learning concepts. It discusses: 1) Linear regression models assume a linear relationship between inputs (X) and outputs (Y). The goal is to learn parameters (β0, β1) that minimize the error between predicted and actual Y values. 2) Multiple linear regression extends this to multiple inputs. Matrix notation compactly represents the model as Y = Xβ + ε. 3) Nearest neighbor regression predicts Y for a new input X0 based on the average Y values of its K nearest neighbors in the training data. This is a nonparametric approach compared to linear regression.

Uploaded by

Wen Hsi Chua
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Lecture 2

This document provides an overview of linear regression and machine learning concepts. It discusses: 1) Linear regression models assume a linear relationship between inputs (X) and outputs (Y). The goal is to learn parameters (β0, β1) that minimize the error between predicted and actual Y values. 2) Multiple linear regression extends this to multiple inputs. Matrix notation compactly represents the model as Y = Xβ + ε. 3) Nearest neighbor regression predicts Y for a new input X0 based on the average Y values of its K nearest neighbors in the training data. This is a nonparametric approach compared to linear regression.

Uploaded by

Wen Hsi Chua
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Big Data and Machine Learning

Lecture Slides 2: Linear Regression

University of Queensland
Outline

I Linear regression.
I Accuracy of linear regression.
I Problems with the linear regression model.
I Nearest-neighbor regression.
I First comparison of parametric (linear regression) vs.
nonparametric (nearest neighbor) learning.
Supervised learning setup
I Given a random variable X , predict another variable Y .
I Example:
I Y = Sales.
I X = Advertising.
I Solution:
I Learn a function fˆ from the data.
I Given input X , output

Y = fˆ(X )

I Simplest candidate function: linear function.

f (X ) = β0 + β1 X

I β0 and β1 are called parameters.


I They are unknown constants to be learned.
I This is called a parametric learning problem:
Simple linear regression

I Assume that the relationship between Y and X is given


approximately by

Y ≈ f (X ) = β0 + β1 X

I More precisely, The deviation by Y and f (X ) is assumed to


be modeled by a random error (or noise) ε

Y = f (X ) + ε
= β0 + β1 X + ε
How to learn the linear regression model?

I The learning of f in general is hard:


I For each x given, you should be able to output f (x): Infinity
of values to learn!!!!!
I But if f is a line, you only need two points to pin it down!!
I The learning of f (x) = β0 + β1 x only requires learning β0 and
β1 .
I Objective:
I Given a sample y1 , y2 , ..., yn and x1 , ..., xn ,
I find two values βˆ0 and βˆ1 ,
I that best fit the sample.
How to find the best fit?
I Look at a possible line and find the residuals

ei = yi − βˆ0 − βˆ1 xi

I Magnitude of the residuals


n
X
ε2i
i=1

I This is called the Sum of squared residuals.


Least Squares Minimization

I Objective function
n
X n
X
Q(β0 , β1 ) = (yi − β0 − β1 xi )2 = ε2i
i=1 i=1

I Find the two numbers β0 and β1 that makes Q smallest


possible.
Linear regression model in matrix form 1

I Matrices of output data, input data and residuals.

     
y1 1 x1 ε1
 ..   .. ..   .. 
 .   . .   . 
y X
     
 yi  ,
=  1 xi  , ε = 
= εi 
  
 ..   .. ..  .. 


 .   . .   . 
yn 1 xn εn

I Matrix of coefficients.
 
β0
β=
β1
Linear regression model in matrix form 2

I Original equations (i is the observation index)

y1 = β0 + β1 x1 + ε1
..
.
yi = β0 + β1 xi + εi
..
.
yn = β0 + β1 xn + εn

I Equations in matrix form

y = Xβ + ε
Least Squares Minimization 2

I The solution is very simple when written im matrix form.


!
= β̂ = (X T X )−1 X T y
βb0
β̂1

I As β̂ is random, how accurate it is given by its variance.

Var (β̂) = σ 2 (X T X )−1


I σ 2 is the variance of the residuals ε.
Multiple Linear regression: 2 inputs
I More than one input

I Y = β0 + β1 X1 + β2 X2 + ε
Multiple Linear regression

I Using matrix notation, going to p inputs is straightforward.


 
1 x1,1 · · · x1,p
 1 x2,1 x2,p
X =


 .. . .. .. 
 . . 
1 xn,1 · · · xn,p
I All the formulas in matrix form are the same!

y = Xβ + e
β̂ = (X T X )−1 X T y
Var (β̂) = σ 2 (X T X )−1
How good is the regression model?

I Is adding input Xj makes sense?


I How many inputs do we need?
I How good a linear regression model is?
I Is there a better model?
Prediction and fit

I From least squares, we get β̂.


I Given x0 , the prediction is

fˆ(x 0 ) = x 0 β̂
= β̂0 + β̂1 x 0,1 + · · · + β̂p x 0,p
I We can now compute
I Mean squared error. (It is linked to R 2 .)
I Test MSE. (Based on x0 in the test sample)
Special input data
I Example: a binary input:

1 if i is a student
studenti =
0 otherwise
I regression function.
balancei = β0 + β1 incomei + β2 studenti + εi

(β0 + β2 ) + β1 incomei + εi if i is a student
=
β0 + β1 incomei + εi otherwise
Discrete variable coding
I Example: a discrete input in the regression of Y on X

 red
X = blue
green

I Binary variable coding:


I Z1 = 1 if X = red and 0 otherwise.
I Z2 = 1 if X = blue and 0 otheriwse.
I Z3 = 1 if X = green and 0 otheriwse.
I Dummy variable trap
I Regression Y = β0 + β1 Z1 + β2 Z2 + β3 Z3 + ε
I 4 unknowns with three equations: Identification problem.
I E [Y |X = red] = β0 + β1
I E [Y |X = blue] = β0 + β2
I E [Y |X = green] = β0 + β3
I Drop one of the three variables Z1 , Z2 and Z3 .
Discrete variable coding (continued)

I Drop Z1 and use red as a base category.


I Contrast function:
red blue green
red 1 0 0
blue 0 1 0
green 0 0 1
I 3 unknowns with three equations in regression
Y = β0 + β2 Z2 + β3 Z3 + ε.
I E [Y |X = red] = β0
I E [Y |X = blue] = β0 + β2
I E [Y |X = green] = β0 + β3
I Interpretation of coefficients: deviations from base category.
Discrete variable coding (continued)
I Drop variable respresenting X = green and use sum
contrasts coding.
I Contrast function:
red blue
red 1 0
blue 0 1
green -1 -1
I 3 unknowns with three equations in regression
Y = β0 + β1 Z1 + β2 Z2 + ε.
I E [Y |X = red] = β0 + β1
I E [Y |X = blue] = β0 + β2
I E [Y |X = green] = β0 − β1 − β2
I Interpretation of coefficients: Average effect

E [Y |X = red] + E [Y |X = blue] + E [Y |X = green] 3β0


= = β0
3 3
Polynomial regression
I Example from auto data. 2-degree polynomial.

mpgi = β0 + β1 horsepoweri + β2 horespower2i + εi


List of potential problems

I Relationship is nonlinear between inputs and outputs.


I Correlation of errors. (Time Series data, Panel data)
I Variance is error term is not constant.
I Outliers and high-leverage points.
I Colinear inputs.
I Endogenity → Causal inference issues.
Causal Inference and other pitfalls

I Simpson’s Paradox

I Interpretability and policy recommendations.


Nearest neighbor regression
I Nearest neighbor regression

I N0 is the set of K nearest neighbors to x0

1 X
fˆ(x0 ) = yi
K
xi ∈N0
Comparison of parametric and nonparametric regression

I Curse of dimensionality

You might also like