0% found this document useful (0 votes)
8 views

Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research

Uploaded by

pgp23.anujs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research

Uploaded by

pgp23.anujs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 78

Understanding the Geometry of Predictive

Models
Workshop at
S P Jain School Institute of Management and Research

Sourish Das

Chennai Mathematical Institute

10-11th June, 2024


Introduction
Reference
Reading material

I Data Mining; Concepts and Techniques, Jiawei Han and


Micheline Kamber, Morgan Kaufman (2006).
I Web Data Mining, Bing Liu, Springer Verlag (2007).
For a good introduction to text mining and information retrieval,
please see.
I An Introduction to Information Retrieval, Christopher
D Manning, Prabhakar Raghavan and Hinrich Schütze,
Cambridge University Press (2009). (Available online at
http://www-nlp.stanford.edu/IR-book).
Supervised Learning
Motivating Examples of Supervised Learning
Ex 1 Given the different features of a new prototype car, can you
predict the mileage or ‘miles per gallon’ of the car?
Motivating Examples of Supervised Learning

Ex 1 Given the different features of a new prototype car, can you


predict the mileage or ‘miles per gallon’ of the car?

I mpg cyl disp hp


Mazda RX4 21.0 6 160 110
Mazda RX4 Wag 21.0 6 160 110
Datsun 710 22.8 4 108 93
Hornet 4 Drive 21.4 6 258 110
.....
Prototype ? 4 120 100
I Note that your objective is to predict the variable mpg.

I We are going to use mtcars data set in R.


Motivating Examples of Supervised Learning

Ex 2 Given the credit history and other features of a loan


applicant, a bank manager want to predict if loan
application would become good or bad loan!!

I Note that your objective is to predict the label of the loan


good or bad!
How to identify if a problem is predictive analytics
problem?

I Ask a question to your client or collaborator: "Do you want


to predict something?"

I If the answer is ’yes’ - then ask which variable?

I Check if that variable is available in the database.

I if yes - then you can consider it as a predictive analytics


problem.
Supervised learning

I Supervised learning algorithms are trained using labeled


data.

I For example, a piece of equipment could have data points


labeled either “F" (failed) or “R" (runs).

I Typically,
y = f (X ),
where y is target variable and X is feature matrix

I Objective: Learn f (.)


Supervised learning

I Supervised learning
y = f (X )
typically are of two types:

1. Regression : target variable y is continuous variable - e.g.,


income, blood pressure, distance etc.

2. Classification: target variable y is categorical or label


variable - e.g., species type, color, class etc.
Data : Quantitative Response

x11 x12 ... x1p y1


x21 x22 ... x2p y2
.. .. .. .. ..
. . . . .
xn1 xn2 ... xnp yn

x11 ∗
x12 ... ∗
x1p y1∗ =?
.. .. .. .. ..
. . . . .

xm1 ∗
xm2 ∗
. . . xmp ∗ =?
ym

I Dtrain = (X , y ), is the traing dataset, where X is the matrix


of predictors or features, y is the dependent or target
variable.
I Dtest = (X ∗ , y ∗ =?) is the test dataset, where X ∗ is the
matrix of predictors or features, and y ∗ is missing and we
want to forecast or predict y ∗
Data : Qualitative Response

x11 x12 ... x1p G1


x21 x22 ... x2p G2
.. .. .. .. ..
. . . . .
xn1 xn2 ... xnp Gn

x11 ∗
x12 ... ∗
x1p G1∗ =?
.. .. .. .. ..
. . . . .

xm1 ∗
xm2 ∗
. . . xmp ∗ =?
Gm

I Qualitative variables are also referred to as categorical or


discrete variables as well as factors.
Motivating Examples of Regression

Ex Given the different features of a new prototype car, can you


predict the mileage or ‘miles per gallon’ of the car?

I mpg cyl disp hp wt


Mazda RX4 21.0 6 160 110 2.620
Mazda RX4 Wag 21.0 6 160 110 2.875
Datsun 710 22.8 4 108 93 2.320
Hornet 4 Drive 21.4 6 258 110 3.215
.....
Prototype ? 4 120 100 3.200

I Note that your objective is to predict the variable mpg.


Plot the data

30
25
mpg

20
15
10

2 3 4 5

wt
Regression Line
mpg=β0 +β1 wt+

30
25
mpg

20
15
10

2 3 4 5

wt
Regression Plane
mpg=β0 +β1 wt+β2 disp+

35
30
25
mpg

disp
500
20

400
300
15

200
100
10

0
1 2 3 4 5 6

wt
Regression Model

I Given a vector of inputs X T = (X1 , X2 , . . . , Xp ), we predict


the output Y via model

Y = β0 + X1 β1 + X2 β2 + · · · + Xp βp + .

The term β0 is the intercept, also known as the bias in


machine learning.
I Often it is convenient to include the constant variable 1 in
X , include β0 in the vector of coefficients β = (β1 , · · · , βp )

I We have data about y and X

I How can we estimate β = (β1 , · · · , βp )?


Regression Line - Adhoc choice of β0 and β1
mpg=35 - 5wt+

30
25
mpg

20
15
10

2 3 4 5

wt
Regression Line - Another Adhoc choice
mpg=39 - 6wt+

30
25
mpg

20
15
10

2 3 4 5

wt
Choice of β
(β0 = 35, β1 = −5) and (β0 = 39, β1 = −6)

−2
−4
−6
β1

−8
−10

30 35 40 45

β0
Choice of β
However, thousands of choices are there, which one is best?

−2
−4
−6
β1

−8
−10

30 35 40 45

β0
Regression Model

I Given a vector of inputs X n×p = ((Xij )), we predict the


output y via model

y n×1 = X n×p β p×1 + n×1 .


     
y1 x11 x12 · · · x1p 1
y2  x21 x22 · · · x2p  2 
y =. , X = . , =.
     
.. . . .. 
 ..   .. . . .   .. 
yn n×1
xn1 xn2 · · · xnp n×p
n n×1
I It is convenient to include the constant variable 1 in X , to
include the intercept.

I How can we estimate β = (β1 , · · · , βp )?


How do we fit Linear Regression Models?

I Many different methods, most popular is least squares.

I minimize the residual sum of squares

RSS(β) = T 
= (y − X β)T (y − X β)
Xn
= (yi − xiT β)2
i=1
Residual Sum of Square : Surface

I RSS(β) is a quadratic function of the parameters

I Its minimum always exists, but may not be unique.


How do we fit Regression models?

I Differentiate RSS(β) with respect to β and equate to 0

∂RSS(β)
=0
∂β

=⇒ (y − X β)T (y − X β) = 0
∂β
=⇒ −2X T (y − X β) = 0
=⇒ X T X β = X T y Normal Equations

I X T X is p × p matrix,
I So normal equations have p unknown and p equations.
System of Equation

I Suppose that for a known matrix Ap×p and vector bp×1 , we


wish to find a vector xp×1 such that

Ax = b

I The standard approach is ordinary least squares linear


regression.
minimize ||Ax − b||2
x

where ||.|| is the Euclidean norm.


I Solution for x is
x̂ = A−1 b
I What happened A is not invertible?
Solution to System of Equation

I If rank (A|b) > rank (A) then solution does not exists.

I If rank (A|b) = rank (A) then at least one solution exists.

I If rank (A|b) = rank (A) = p, that is A is a full-rank matrix,


then A−1 uniquely exists and the solution x̂ = A−1 b is
unique.

I If rank (A|b) = rank (A) < p, that is A is a less than full-rank


matrix, then x has infinitely many solutions. This is
considered as ill-posed problem. Which solution to choose
and how to choose?
How do we fit Regression models?
Theorem
For normal equations,

rank (X T X |X T y ) = rank (X T X )

I Whatever may be your data, irrespective of that, normal


equations gurantees at least one solution.

I Atleast one solution always exists - if you adopt least


squares method.

I If X T X is nonsingular, i.e., rank (X T X ) = p, then the


unique solution is given by

β̂ = (X T X )−1 X T y
Quiz: Mean Absolute Deviation?

I What about mean absolute deviation?


n
X
∆(β) = ||yi − xiT β||
i=1
Quiz: Mean Absolute Deviation?

I What about mean absolute deviation?


n
X
∆(β) = ||yi − xiT β||
i=1

I Conceptually no problem - certainly you can do that.


How do we fit Regression models?
1.0

1.0
0.8

0.8
0.6

0.6
abs(x)
x^2

0.4

0.4
0.2

0.2
0.0

0.0

−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0

x x
Model Assumptions

I Given a vector of inputs X n×p = ((Xij )), we predict the


output y via model

y n×1 = X n×p β p×1 + n×1 .

I X n×p known as design matrix typically are considered as


deterministic and n > p.

I , (also known as error / residuals) for all i are random


variables, i = 1, 2, · · · , n
1. E(i ) = 0, ∀ i

2. Var (i ) = E(2i ) = σ 2 , ∀ i Homoscedasticity

3. Cov (i , j ) = E(i j ) = 0, ∀ i 6= j Independence


Model Assumptions in Matrix Notation

I Given a vector of inputs X n×p = ((Xij )), we predict the


output y via model

y n×1 = X n×p β p×1 + n×1 .

I X n×p known as design matrix typically are considered as


deterministic.

I , (also known as error / residuals) for all i are random


variables, i = 1, 2, · · · , n
1. E() = 0n

2. Cov () = σ 2 I n
Implication of the Assumptions

I Assumption:
1. E() = 0n

2. Cov () = σ 2 I n
I It induces distribution on y, such that

E(y) = E(X β + ) = X β + E() = X β

and
Cov (y) = Cov (X β + ) = σ 2 I n
I Note that we have not made any distributional assumption
on  yet.

I We will introduce that assumption little later.


Implication of the Assumptions
I What is the expected value of cy? If c is a constant.
Result 1 We know
E(y) = X β,
then
E(cy) = cX β.
I Now consider the ordinary least square estimator (OLS)
estimator of β?
β̂ = (X T X )−1 X T y

E(β̂) = E((X T X )−1 X T y)


= (X T X )−1 X T E(y) = (X T X )−1 X T X β
= β

Result 2 OLS estimator β̂ is an unbiased estimator of β.


Implication of the Assumptions

I Suppose we are interested in some linear combination of


the regression corefficients, like f (β) = c T β.

Result 3 Then the unbiased estimatior of c T β is c T β̂, i.e.,

E(c T β̂) = c T β,

I Suppose c = x0 is a test point. Then we are interested in


prediction f (x0 ) = x0T β are of this form.
Gauss Markov Theorem

I If we have any other linear estimator θ̃ = aT y is unbiased


for c T β, that is
E(aT y) = c T β,
then
Var (c T β̂) ≤ Var (aT y)
I Proof is home work problem.

Note OLS estimates of the parameters β have the smallest


variance among all linear unbiased estimates.
Notes on Gauss Markov Theorem
I Consider the mean squared error (MSE) of an estimator θ̃
in estimating θ:

MSE(θ̃) = E(θ̃ − θ)2


= Var (θ̃) + [E(θ̃) − θ]2
= Var (θ̃) + [bias]2

I The Gauss-Markov theorem implies that the least squares


estimator has the smallest MSE of all linear estimators with
no bias.

I However, there may well exist a biased estimator with


smaller MSE. For example: (i) Ridge estimator or (ii)
James-Stein shrinkage estimator of β trade a little bias for
reduction of variance and its MSE are lowere than the OLS
estimator.
Why Mean Square Error?

I MSE is directly related to prediction accuracy.

I Consider the prediction of the new response at input x0

y0 = f (x0 ) + 0 .

I The expected prediction error of an estimate f̂ (x0 ) = x0T β̂ is

E(y0 − f̂ (x0 ))2 = σ 2 + E(x0T β̂ − f (x0 ))2


= σ 2 + MSE(x0T β̂)

I Expected prediction error and MSE differ


only by the constant σ 2 .
Linear Regression
mpg = β0 + β1 wt + 

10 15 20 25 30
mpg

2 3 4 5

wt
Linear Regression

I mpg = β0 + β1 wt + 
I We write the model in terms of linear models

y = Xβ + 

where y = (mpg1 , mpg2 , . . . , mpgn )T ;


 
1 wt1
1 wt2 
X = .
 
.. 
 .. . 
1 wtn

β = (β0 , β1 )T and  = (1 , 2 , . . . , n )T


Linear Regression

I Normal Equations:

β̂ = (β̂0 β̂1 )T = (X T X )−1 X T y


 Pn −1  Pn 
n i=1 wti i=1 mpg i
= Pn P 2
Pn
i=1 wti i=1 wti i=1 wti .mpgi
Quadratic Regression
mpg = β0 + β1 hp + β2 hp2 + 

10 15 20 25 30
mpg

50 150 250

hp
Quadratic Regression

I mpg = β0 + β1 hp + β2 hp2 + 
I We write the model in terms of linear models

y = Xβ + 

where y = (mpg1 , mpg2 , . . . , mpgn )T ;

1 hp1 hp21
 
1 hp2 hp2 
2
X = . ..  ,

. ..
. . . 
1 hpn hp2n

β = (β0 , β1 , β2 )T and  = (1 , 2 , . . . , n )T


I The linear model is linear in parameter.
Linear Regression

I Normal Equations:

β̂ = (β̂0 β̂1 β̂2 )T


= (X T X )−1 X T y
Pn Pn −1  Pn
hp2i
 
n hp i=1 mpgi
P i=1 2i Pi=1
 ni=1 hpi .mpgi 
Pn n 3
P
=  i=1 hpi i=1 hpi i=1 hpi Pn
Pn 2
P n 3
Pn 4 2
i=1 hpi i=1 hpi i=1 hpi i=1 hpi .mpgi
Regression Line
mpg=β0 +β1 hp+

30
25
mpg

20
15
10

50 100 150 200 250 300

hp
Feature Engineering
mpg=β0 +β1 hp+β2 hp2 +

35
30
25

hp2
mpg

120000
100000
20

80000
60000
15

40000
20000
10

0
50 100 150 200 250 300 350

hp
Sampling distribution of β

I Consider the standard linear model

y = X β + ,

where  ∼ N(0, σ 2 I n ) and n > p

I This implies y ∼ N(X β, σ 2 I n )

I The least square estimator of β is β̂ = (X T X )−1 X T y

I The sampling distribution of β̂ is

β̂ ∼ Np (β, σ 2 (X T X )−1 )
Sampling distribution of β

Result If y p ∼ Np (µ, Σ), and cq×p matrix. Then


z = cy ∼ Nq (cµ, cΣc T )

You can use this result to argue that the sampling


distribution of β̂ is

β̂ ∼ Np (β, σ 2 (X T X )−1 )
Sampling distribution for β0 and β1
mpg=β0 +β1 wt+

−9 −8 −7 −6 −5 −4 −3 −2

10

2
β1

30 35 40 45

β0
Sampling distribution for β0 and β1
mpg=β0 +β1 wt+

2
0
−2

10
β1

5
−4

2
−6
−8

0 10 20 30 40

β0
Sampling distribution for β0 and β1
mpg=β0 +β1 wt+

2
0
−2

10
β1

5
−4

2
−6
−8

0 10 20 30 40

β0
Sampling distribution for β0 and β1
mpg=β0 +β1 wt+

2
0
−2

10
β1

5
−4

2
−6
−8

0 10 20 30 40

β0
Sampling distribution

I y = X β + , where  ∼ N(0, σ 2 I n )

I OLS estimator is β̂ = (X T X )−1 X T y

I Sampling distribution of β̂ is

β̂ ∼ N(β, σ 2 (X T X )−1 )

I Residual Sum of Square is

RSS = (y − X β̂)T (y − X β̂)

In addition,
RSS ∼ σ 2 χ2n−p
Statistical Inference for β
I For i th predictor,

β̂ − βi
qi ∼ N(0, 1)
σ (X T X )−1
ii

I From the χ2 distribution of RSS we have

(n − p)s2
∼ χ2n−p ,
σ2
RSS
where s2 = n−p , this implies
 
RSS
E = σ2,
n−p

i.e., s2 is an unbiased estimator of σ 2 .


Statistical Inference for β
I Note that in the sampling distribution of β̂, the σ 2 is
unknown

I As we estimate the σ 2 by its corresponding unbiased


estimator s2 = RSS
n−p ,

β̂i − βi
t= q ∼ tn−p ,
T −1
s (X X )ii
q
where s (X T X )−1
ii is the standard error of β̂i

I To test null hypothesis H0 : βi = 0 (predictor Xi has no


impact on the dependent variable y ) - we can use the
statistic t.
Statistical Inference for β

I To test null hypothesis H0 : βi = 0


(predictor Xi has no impact on the dependent variable y )

I Alternate hypothesis HA : βi 6= 0
(predictor Xi has impact on the y )

I Under the H0 , test statistics is

β̂i − 0
t= q ∼ tn−p
s (X T X )−1
ii

At 100 × α%, level of significane, if tobserved > tn−p (α) or


tobserved < −tn−p (α) then we reject null hypothesis.
Statistical Inference for β
I H0 : βi = 0 vs HA : βi 6= 0
I Under the H0 , test statistics is

β̂i − 0 β̂i − 0
t= q = ∼ tn−p
s (X T X )−1 se(β̂i )
ii

I The p-value is the probability of obtaining test results at


least as extreme as the observed result, assuming that the
null hypothesis is correct.

I P-value = 2 ∗ P(t > |toberved ||H0 is true)


I If the P-value is too small – we reject the null hypothesis.
I Otherwise we say we fail to reject null hypothesis
Does wt has statistically significant effect on mpg?
I mpg=β0 + β1 wt+

I H0 : β1 = 0 vs HA : β1 6= 0
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.285 1.878 19.858 0
wt -5.344 0.559 -9.559 0

I β̂1 = −5.344 and se(β̂1 ) = 0.559, and

β̂1 − 0 −5.344 − 0
= = −9.559
se(β̂1 ) 0.559

and p-value < 0.01

I weight has statistically significant effect on mpg.


Does wt, and/or hp has statistically significant effect
on mpg?
I mpg=β0 + β1 wt+β2 hp+

I H0 : β1 = 0 vs HA : β1 6= 0
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.227 1.599 23.285 0.000
wt -3.878 0.633 -6.129 0.000
hp -0.032 0.009 -3.519 0.001

I β̂1 = −3.878 and se(β̂1 ) = 0.633, and under H0 ,

β̂1 − 0 −3.878 − 0
t-value = = = −6.129
se(β̂1 ) 0.633

and p-value < 0.01

I weight has statistically significant effect on mpg.


Compare the two models

Model 1 mpg=β0 + β1 wt+


Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.285 1.878 19.858 0
wt -5.344 0.559 -9.559 0
Model 2 mpg=β0 + β1 wt+β2 hp+
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.227 1.599 23.285 0.000
wt -3.878 0.633 -6.129 0.000
hp -0.032 0.009 -3.519 0.001
1. Model 1 is a 2D model, and Model 2 is a 3D model: Are
they comparable?
2. The se(β̂1 ) in Model 2 is higher than Model 1. Why?
I Lets discuss this issue.
Let’s understand the Model Complexity
Linear Regression: mpg = β0 + β1 hp + 

30
mpg

20
10

50 150 250

hp

R-squared = 0.602 RMSE = 3.74


Let’s understand the Model Complexity
Quadratic Regression: mpg = β0 + β1 hp + β2 hp2 + 

10 15 20 25 30 35

hp2
mpg

120000
100000
80000
60000
40000
20000
0
50100150200250300350

hp
Let’s understand the Model Complexity
Quadratic Regression: mpg = β0 + β1 hp + β2 hp2 + 

30
mpg

20
10

50 150 250

hp

R-squared = 0.756 RMSE = 2.93


Let’s understand the Model Complexity
Polynomial Regression of order 3: mpg = β0 + β1 hp + β2 hp2 + β3 hp3 + 

30
mpg

20
10

50 150 250

hp

R-squared = 0.761 RMSE = 2.903


Let’s understand the Model Complexity

Model 1 R-squared = 0.602 RMSE = 3.74


Model 2 R-squared = 0.756 RMSE = 2.93
Model 3 R-squared = 0.761 RMSE = 2.903
Model Complexity

M1 Regression Line
mpg = β0 + β1 hp + 

M2 Regression Plane
mpg = β0 + β1 hp + β2 hp2 + 

M3 Regression 3-dimension hyper plane


mpg = β0 + β1 hp + β2 hp2 + β3 hp3 + 

M3’ Regression 3-dimension hyper plane


mpg = β0 + β1 hp + β2 wt + β3 disp+ 
Quiz: Model Complexity

I Should we blindly increase the Model Complexity?


Quiz: Model Complexity

I Should we blindly increase the Model Complexity?

I We have to be careful about the bias-variance


trade-off.
What is multicollinearity?

I Consider the standard linear model

y = X β + ,

where  ∼ N(0, σ 2 I n ) and n > p

I This implies y ∼ N(X β, σ 2 I n )

I The least square estimator of β is β̂ = (X T X )−1 X T y

I The sampling distribution of β̂ is

β̂ ∼ Np (β, σ 2 (X T X )−1 )
What is multicollinearity?

I If correlation between two predictors of X is 1, that means


one column is exactly dependent on other, that will result
det(X T X ) = 0

I Heance X T X will not be invertible, (because


T
(X T X )−1 = Adj(X T X )
det(X X )

I In such case unique solution does not exists.


Why multicollinearity is a problem?

I If correlation between two predictors of X is nearly 1 or -1,


but not exactly 1.
I For example cor (Xi , Xj ) = 0.99 - what happens then?

I det(X T X ) = δ > 0, where δ is a very small value.

I X T X is invertible - but every element of (X T X )−1 will be


very large.

I Unique solution β̂ exists but Cov (β̂) = σ 2 (X T X )−1 will be


extremely large - so standard error will be very large.
Hence valid statistical inference cannot be implemented.
Correlated Predictors

I We consider simple no-intercept model:

mpg=β1 wt+β2 drat+

I ρ(wt,drat)= −0.71
Sampling distribution for β0 and β1
OLS Estimator induces ρ(β̂1 , β̂2 ) = −0.92

10

10
9

5
2
8
β2

7
6
5
4

−5 −4 −3 −2 −1 0 1 2

β1
Sampling distribution for β0 and β1
OLS Estimator induces ρ(β̂1 , β̂2 ) = −0.92

10

10
9

5
2
8
β2

7
6
5
4

−5 −4 −3 −2 −1 0 1 2

β1
Sampling distribution for β0 and β1
Ridge Estimator induces ρ(β̂1 , β̂2 ) = −0.73

10
9

10
8

2
β2

5
6
5
4

−5 −4 −3 −2 −1 0 1 2

β1
Thank You
[email protected]

You might also like