Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
Models
Workshop at
S P Jain School Institute of Management and Research
Sourish Das
I Typically,
y = f (X ),
where y is target variable and X is feature matrix
I Supervised learning
y = f (X )
typically are of two types:
30
25
mpg
20
15
10
2 3 4 5
wt
Regression Line
mpg=β0 +β1 wt+
30
25
mpg
20
15
10
2 3 4 5
wt
Regression Plane
mpg=β0 +β1 wt+β2 disp+
35
30
25
mpg
disp
500
20
400
300
15
200
100
10
0
1 2 3 4 5 6
wt
Regression Model
Y = β0 + X1 β1 + X2 β2 + · · · + Xp βp + .
30
25
mpg
20
15
10
2 3 4 5
wt
Regression Line - Another Adhoc choice
mpg=39 - 6wt+
30
25
mpg
20
15
10
2 3 4 5
wt
Choice of β
(β0 = 35, β1 = −5) and (β0 = 39, β1 = −6)
−2
−4
−6
β1
−8
−10
30 35 40 45
β0
Choice of β
However, thousands of choices are there, which one is best?
−2
−4
−6
β1
−8
−10
30 35 40 45
β0
Regression Model
RSS(β) = T
= (y − X β)T (y − X β)
Xn
= (yi − xiT β)2
i=1
Residual Sum of Square : Surface
∂RSS(β)
=0
∂β
∂
=⇒ (y − X β)T (y − X β) = 0
∂β
=⇒ −2X T (y − X β) = 0
=⇒ X T X β = X T y Normal Equations
I X T X is p × p matrix,
I So normal equations have p unknown and p equations.
System of Equation
Ax = b
I If rank (A|b) > rank (A) then solution does not exists.
rank (X T X |X T y ) = rank (X T X )
β̂ = (X T X )−1 X T y
Quiz: Mean Absolute Deviation?
1.0
0.8
0.8
0.6
0.6
abs(x)
x^2
0.4
0.4
0.2
0.2
0.0
0.0
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
x x
Model Assumptions
2. Cov () = σ 2 I n
Implication of the Assumptions
I Assumption:
1. E() = 0n
2. Cov () = σ 2 I n
I It induces distribution on y, such that
and
Cov (y) = Cov (X β + ) = σ 2 I n
I Note that we have not made any distributional assumption
on yet.
E(c T β̂) = c T β,
y0 = f (x0 ) + 0 .
10 15 20 25 30
mpg
2 3 4 5
wt
Linear Regression
I mpg = β0 + β1 wt +
I We write the model in terms of linear models
y = Xβ +
I Normal Equations:
10 15 20 25 30
mpg
50 150 250
hp
Quadratic Regression
I mpg = β0 + β1 hp + β2 hp2 +
I We write the model in terms of linear models
y = Xβ +
1 hp1 hp21
1 hp2 hp2
2
X = . .. ,
. ..
. . .
1 hpn hp2n
I Normal Equations:
30
25
mpg
20
15
10
hp
Feature Engineering
mpg=β0 +β1 hp+β2 hp2 +
35
30
25
hp2
mpg
120000
100000
20
80000
60000
15
40000
20000
10
0
50 100 150 200 250 300 350
hp
Sampling distribution of β
y = X β + ,
β̂ ∼ Np (β, σ 2 (X T X )−1 )
Sampling distribution of β
β̂ ∼ Np (β, σ 2 (X T X )−1 )
Sampling distribution for β0 and β1
mpg=β0 +β1 wt+
−9 −8 −7 −6 −5 −4 −3 −2
10
2
β1
30 35 40 45
β0
Sampling distribution for β0 and β1
mpg=β0 +β1 wt+
2
0
−2
10
β1
5
−4
2
−6
−8
0 10 20 30 40
β0
Sampling distribution for β0 and β1
mpg=β0 +β1 wt+
2
0
−2
10
β1
5
−4
2
−6
−8
0 10 20 30 40
β0
Sampling distribution for β0 and β1
mpg=β0 +β1 wt+
2
0
−2
10
β1
5
−4
2
−6
−8
0 10 20 30 40
β0
Sampling distribution
I y = X β + , where ∼ N(0, σ 2 I n )
I Sampling distribution of β̂ is
β̂ ∼ N(β, σ 2 (X T X )−1 )
In addition,
RSS ∼ σ 2 χ2n−p
Statistical Inference for β
I For i th predictor,
β̂ − βi
qi ∼ N(0, 1)
σ (X T X )−1
ii
(n − p)s2
∼ χ2n−p ,
σ2
RSS
where s2 = n−p , this implies
RSS
E = σ2,
n−p
β̂i − βi
t= q ∼ tn−p ,
T −1
s (X X )ii
q
where s (X T X )−1
ii is the standard error of β̂i
I Alternate hypothesis HA : βi 6= 0
(predictor Xi has impact on the y )
β̂i − 0
t= q ∼ tn−p
s (X T X )−1
ii
β̂i − 0 β̂i − 0
t= q = ∼ tn−p
s (X T X )−1 se(β̂i )
ii
I H0 : β1 = 0 vs HA : β1 6= 0
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.285 1.878 19.858 0
wt -5.344 0.559 -9.559 0
β̂1 − 0 −5.344 − 0
= = −9.559
se(β̂1 ) 0.559
I H0 : β1 = 0 vs HA : β1 6= 0
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.227 1.599 23.285 0.000
wt -3.878 0.633 -6.129 0.000
hp -0.032 0.009 -3.519 0.001
β̂1 − 0 −3.878 − 0
t-value = = = −6.129
se(β̂1 ) 0.633
30
mpg
20
10
50 150 250
hp
10 15 20 25 30 35
hp2
mpg
120000
100000
80000
60000
40000
20000
0
50100150200250300350
hp
Let’s understand the Model Complexity
Quadratic Regression: mpg = β0 + β1 hp + β2 hp2 +
30
mpg
20
10
50 150 250
hp
30
mpg
20
10
50 150 250
hp
M1 Regression Line
mpg = β0 + β1 hp +
M2 Regression Plane
mpg = β0 + β1 hp + β2 hp2 +
y = X β + ,
β̂ ∼ Np (β, σ 2 (X T X )−1 )
What is multicollinearity?
I ρ(wt,drat)= −0.71
Sampling distribution for β0 and β1
OLS Estimator induces ρ(β̂1 , β̂2 ) = −0.92
10
10
9
5
2
8
β2
7
6
5
4
−5 −4 −3 −2 −1 0 1 2
β1
Sampling distribution for β0 and β1
OLS Estimator induces ρ(β̂1 , β̂2 ) = −0.92
10
10
9
5
2
8
β2
7
6
5
4
−5 −4 −3 −2 −1 0 1 2
β1
Sampling distribution for β0 and β1
Ridge Estimator induces ρ(β̂1 , β̂2 ) = −0.73
10
9
10
8
2
β2
5
6
5
4
−5 −4 −3 −2 −1 0 1 2
β1
Thank You
[email protected]