0% found this document useful (0 votes)

8 views

Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research

Uploaded by

pgp23.anujs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research

Uploaded by

pgp23.anujs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 78

Understanding the Geometry of Predictive

Models
Workshop at
S P Jain School Institute of Management and Research

Sourish Das

Chennai Mathematical Institute

10-11th June, 2024

Introduction
Reference
Reading material

I Data Mining; Concepts and Techniques, Jiawei Han and

Micheline Kamber, Morgan Kaufman (2006).
I Web Data Mining, Bing Liu, Springer Verlag (2007).
For a good introduction to text mining and information retrieval,
please see.
I An Introduction to Information Retrieval, Christopher
D Manning, Prabhakar Raghavan and Hinrich Schütze,
Cambridge University Press (2009). (Available online at
http://www-nlp.stanford.edu/IR-book).
Supervised Learning
Motivating Examples of Supervised Learning
Ex 1 Given the different features of a new prototype car, can you
predict the mileage or ‘miles per gallon’ of the car?
Motivating Examples of Supervised Learning

Ex 1 Given the different features of a new prototype car, can you

predict the mileage or ‘miles per gallon’ of the car?

I mpg cyl disp hp

Mazda RX4 21.0 6 160 110
Mazda RX4 Wag 21.0 6 160 110
Datsun 710 22.8 4 108 93
Hornet 4 Drive 21.4 6 258 110
.....
Prototype ? 4 120 100
I Note that your objective is to predict the variable mpg.

I We are going to use mtcars data set in R.

Motivating Examples of Supervised Learning

Ex 2 Given the credit history and other features of a loan

applicant, a bank manager want to predict if loan
application would become good or bad loan!!

I Note that your objective is to predict the label of the loan

good or bad!
How to identify if a problem is predictive analytics
problem?

I Ask a question to your client or collaborator: "Do you want

to predict something?"

I If the answer is ’yes’ - then ask which variable?

I Check if that variable is available in the database.

I if yes - then you can consider it as a predictive analytics

problem.
Supervised learning

I Supervised learning algorithms are trained using labeled

data.

I For example, a piece of equipment could have data points

labeled either “F" (failed) or “R" (runs).

I Typically,
y = f (X ),
where y is target variable and X is feature matrix

I Objective: Learn f (.)

Supervised learning

I Supervised learning
y = f (X )
typically are of two types:

1. Regression : target variable y is continuous variable - e.g.,

income, blood pressure, distance etc.

2. Classification: target variable y is categorical or label

variable - e.g., species type, color, class etc.
Data : Quantitative Response

x11 x12 ... x1p y1

x21 x22 ... x2p y2
.. .. .. .. ..
. . . . .
xn1 xn2 ... xnp yn
∗
x11 ∗
x12 ... ∗
x1p y1∗ =?
.. .. .. .. ..
. . . . .
∗
xm1 ∗
xm2 ∗
. . . xmp ∗ =?
ym

I Dtrain = (X , y ), is the traing dataset, where X is the matrix

of predictors or features, y is the dependent or target
variable.
I Dtest = (X ∗ , y ∗ =?) is the test dataset, where X ∗ is the
matrix of predictors or features, and y ∗ is missing and we
want to forecast or predict y ∗
Data : Qualitative Response

x11 x12 ... x1p G1

x21 x22 ... x2p G2
.. .. .. .. ..
. . . . .
xn1 xn2 ... xnp Gn
∗
x11 ∗
x12 ... ∗
x1p G1∗ =?
.. .. .. .. ..
. . . . .
∗
xm1 ∗
xm2 ∗
. . . xmp ∗ =?
Gm

I Qualitative variables are also referred to as categorical or

discrete variables as well as factors.
Motivating Examples of Regression

Ex Given the different features of a new prototype car, can you

predict the mileage or ‘miles per gallon’ of the car?

I mpg cyl disp hp wt

Mazda RX4 21.0 6 160 110 2.620
Mazda RX4 Wag 21.0 6 160 110 2.875
Datsun 710 22.8 4 108 93 2.320
Hornet 4 Drive 21.4 6 258 110 3.215
.....
Prototype ? 4 120 100 3.200

I Note that your objective is to predict the variable mpg.

Plot the data

30
25
mpg

20
15
10

2 3 4 5

wt
Regression Line
mpg=β0 +β1 wt+

30
25
mpg

20
15
10

2 3 4 5

wt
Regression Plane
mpg=β0 +β1 wt+β2 disp+

35
30
25
mpg

disp
500
20

400
300
15

200
100
10

0
1 2 3 4 5 6

wt
Regression Model

I Given a vector of inputs X T = (X1 , X2 , . . . , Xp ), we predict

the output Y via model

Y = β0 + X1 β1 + X2 β2 + · · · + Xp βp + .

The term β0 is the intercept, also known as the bias in

machine learning.
I Often it is convenient to include the constant variable 1 in
X , include β0 in the vector of coefficients β = (β1 , · · · , βp )

I We have data about y and X

I How can we estimate β = (β1 , · · · , βp )?

Regression Line - Adhoc choice of β0 and β1
mpg=35 - 5wt+

30
25
mpg

20
15
10

2 3 4 5

wt
Regression Line - Another Adhoc choice
mpg=39 - 6wt+

30
25
mpg

20
15
10

2 3 4 5

wt
Choice of β
(β0 = 35, β1 = −5) and (β0 = 39, β1 = −6)

−2
−4
−6
β1

−8
−10

30 35 40 45

β0
Choice of β
However, thousands of choices are there, which one is best?

−2
−4
−6
β1

−8
−10

30 35 40 45

β0
Regression Model

I Given a vector of inputs X n×p = ((Xij )), we predict the

output y via model

y n×1 = X n×p β p×1 + n×1 .

     
y1 x11 x12 · · · x1p 1
y2  x21 x22 · · · x2p  2 
y =. , X = . , =.
     
.. . . .. 
 ..   .. . . .   .. 
yn n×1
xn1 xn2 · · · xnp n×p
n n×1
I It is convenient to include the constant variable 1 in X , to
include the intercept.

I How can we estimate β = (β1 , · · · , βp )?

How do we fit Linear Regression Models?

I Many different methods, most popular is least squares.

I minimize the residual sum of squares

RSS(β) = T
= (y − X β)T (y − X β)
Xn
= (yi − xiT β)2
i=1
Residual Sum of Square : Surface

I RSS(β) is a quadratic function of the parameters

I Its minimum always exists, but may not be unique.

How do we fit Regression models?

I Differentiate RSS(β) with respect to β and equate to 0

∂RSS(β)
=0
∂β
∂
=⇒ (y − X β)T (y − X β) = 0
∂β
=⇒ −2X T (y − X β) = 0
=⇒ X T X β = X T y Normal Equations

I X T X is p × p matrix,
I So normal equations have p unknown and p equations.
System of Equation

I Suppose that for a known matrix Ap×p and vector bp×1 , we

wish to find a vector xp×1 such that

Ax = b

I The standard approach is ordinary least squares linear

regression.
minimize ||Ax − b||2
x

where ||.|| is the Euclidean norm.

I Solution for x is
x̂ = A−1 b
I What happened A is not invertible?
Solution to System of Equation

I If rank (A|b) > rank (A) then solution does not exists.

I If rank (A|b) = rank (A) then at least one solution exists.

I If rank (A|b) = rank (A) = p, that is A is a full-rank matrix,

then A−1 uniquely exists and the solution x̂ = A−1 b is
unique.

I If rank (A|b) = rank (A) < p, that is A is a less than full-rank

matrix, then x has infinitely many solutions. This is
considered as ill-posed problem. Which solution to choose
and how to choose?
How do we fit Regression models?
Theorem
For normal equations,

rank (X T X |X T y ) = rank (X T X )

I Whatever may be your data, irrespective of that, normal

equations gurantees at least one solution.

I Atleast one solution always exists - if you adopt least

squares method.

I If X T X is nonsingular, i.e., rank (X T X ) = p, then the

unique solution is given by

β̂ = (X T X )−1 X T y
Quiz: Mean Absolute Deviation?

I What about mean absolute deviation?

n
X
∆(β) = ||yi − xiT β||
i=1
Quiz: Mean Absolute Deviation?

I What about mean absolute deviation?

n
X
∆(β) = ||yi − xiT β||
i=1

I Conceptually no problem - certainly you can do that.

How do we fit Regression models?
1.0

1.0
0.8

0.8
0.6

0.6
abs(x)
x^2

0.4

0.4
0.2

0.2
0.0

0.0

−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0

x x
Model Assumptions

I Given a vector of inputs X n×p = ((Xij )), we predict the

output y via model

y n×1 = X n×p β p×1 + n×1 .

I X n×p known as design matrix typically are considered as

deterministic and n > p.

I , (also known as error / residuals) for all i are random

variables, i = 1, 2, · · · , n
1. E(i ) = 0, ∀ i

2. Var (i ) = E(2i ) = σ 2 , ∀ i Homoscedasticity

3. Cov (i , j ) = E(i j ) = 0, ∀ i 6= j Independence

Model Assumptions in Matrix Notation

I Given a vector of inputs X n×p = ((Xij )), we predict the

output y via model

y n×1 = X n×p β p×1 + n×1 .

I X n×p known as design matrix typically are considered as

deterministic.

I , (also known as error / residuals) for all i are random

variables, i = 1, 2, · · · , n
1. E() = 0n

2. Cov () = σ 2 I n
Implication of the Assumptions

I Assumption:
1. E() = 0n

2. Cov () = σ 2 I n
I It induces distribution on y, such that

E(y) = E(X β + ) = X β + E() = X β

and
Cov (y) = Cov (X β + ) = σ 2 I n
I Note that we have not made any distributional assumption
on yet.

I We will introduce that assumption little later.

Implication of the Assumptions
I What is the expected value of cy? If c is a constant.
Result 1 We know
E(y) = X β,
then
E(cy) = cX β.
I Now consider the ordinary least square estimator (OLS)
estimator of β?
β̂ = (X T X )−1 X T y

E(β̂) = E((X T X )−1 X T y)

= (X T X )−1 X T E(y) = (X T X )−1 X T X β
= β

Result 2 OLS estimator β̂ is an unbiased estimator of β.

Implication of the Assumptions

I Suppose we are interested in some linear combination of

the regression corefficients, like f (β) = c T β.

Result 3 Then the unbiased estimatior of c T β is c T β̂, i.e.,

E(c T β̂) = c T β,

I Suppose c = x0 is a test point. Then we are interested in

prediction f (x0 ) = x0T β are of this form.
Gauss Markov Theorem

I If we have any other linear estimator θ̃ = aT y is unbiased

for c T β, that is
E(aT y) = c T β,
then
Var (c T β̂) ≤ Var (aT y)
I Proof is home work problem.

Note OLS estimates of the parameters β have the smallest

variance among all linear unbiased estimates.
Notes on Gauss Markov Theorem
I Consider the mean squared error (MSE) of an estimator θ̃
in estimating θ:

MSE(θ̃) = E(θ̃ − θ)2

= Var (θ̃) + [E(θ̃) − θ]2
= Var (θ̃) + [bias]2

I The Gauss-Markov theorem implies that the least squares

estimator has the smallest MSE of all linear estimators with
no bias.

I However, there may well exist a biased estimator with

smaller MSE. For example: (i) Ridge estimator or (ii)
James-Stein shrinkage estimator of β trade a little bias for
reduction of variance and its MSE are lowere than the OLS
estimator.
Why Mean Square Error?

I MSE is directly related to prediction accuracy.

I Consider the prediction of the new response at input x0

y0 = f (x0 ) + 0 .

I The expected prediction error of an estimate f̂ (x0 ) = x0T β̂ is

E(y0 − f̂ (x0 ))2 = σ 2 + E(x0T β̂ − f (x0 ))2

= σ 2 + MSE(x0T β̂)

I Expected prediction error and MSE differ

only by the constant σ 2 .
Linear Regression
mpg = β0 + β1 wt +

10 15 20 25 30
mpg

2 3 4 5

wt
Linear Regression

I mpg = β0 + β1 wt +
I We write the model in terms of linear models

y = Xβ +

where y = (mpg1 , mpg2 , . . . , mpgn )T ;

 
1 wt1
1 wt2 
X = .
 
.. 
 .. . 
1 wtn

β = (β0 , β1 )T and = (1 , 2 , . . . , n )T

Linear Regression

I Normal Equations:

β̂ = (β̂0 β̂1 )T = (X T X )−1 X T y

Pn −1 Pn
n i=1 wti i=1 mpg i
= Pn P 2
Pn
i=1 wti i=1 wti i=1 wti .mpgi
Quadratic Regression
mpg = β0 + β1 hp + β2 hp2 +

10 15 20 25 30
mpg

50 150 250

hp
Quadratic Regression

I mpg = β0 + β1 hp + β2 hp2 +
I We write the model in terms of linear models

y = Xβ +

where y = (mpg1 , mpg2 , . . . , mpgn )T ;

1 hp1 hp21
 
1 hp2 hp2 
2
X = . ..  ,

. ..
. . . 
1 hpn hp2n

β = (β0 , β1 , β2 )T and = (1 , 2 , . . . , n )T

I The linear model is linear in parameter.
Linear Regression

I Normal Equations:

β̂ = (β̂0 β̂1 β̂2 )T

= (X T X )−1 X T y
Pn Pn −1  Pn
hp2i
 
n hp i=1 mpgi
P i=1 2i Pi=1
 ni=1 hpi .mpgi 
Pn n 3
P
=  i=1 hpi i=1 hpi i=1 hpi Pn
Pn 2
P n 3
Pn 4 2
i=1 hpi i=1 hpi i=1 hpi i=1 hpi .mpgi
Regression Line
mpg=β0 +β1 hp+

30
25
mpg

20
15
10

50 100 150 200 250 300

hp
Feature Engineering
mpg=β0 +β1 hp+β2 hp2 +

35
30
25

hp2
mpg

120000
100000
20

80000
60000
15

40000
20000
10

0
50 100 150 200 250 300 350

hp
Sampling distribution of β

I Consider the standard linear model

y = X β + ,

where ∼ N(0, σ 2 I n ) and n > p

I This implies y ∼ N(X β, σ 2 I n )

I The least square estimator of β is β̂ = (X T X )−1 X T y

I The sampling distribution of β̂ is

β̂ ∼ Np (β, σ 2 (X T X )−1 )
Sampling distribution of β

Result If y p ∼ Np (µ, Σ), and cq×p matrix. Then

z = cy ∼ Nq (cµ, cΣc T )

You can use this result to argue that the sampling

distribution of β̂ is

β̂ ∼ Np (β, σ 2 (X T X )−1 )
Sampling distribution for β0 and β1
mpg=β0 +β1 wt+

−9 −8 −7 −6 −5 −4 −3 −2

2
β1

30 35 40 45

β0
Sampling distribution for β0 and β1
mpg=β0 +β1 wt+

2
0
−2

10
β1

5
−4

2
−6
−8

0 10 20 30 40

β0
Sampling distribution for β0 and β1
mpg=β0 +β1 wt+

2
0
−2

10
β1

5
−4

2
−6
−8

0 10 20 30 40

β0
Sampling distribution for β0 and β1
mpg=β0 +β1 wt+

2
0
−2

10
β1

5
−4

2
−6
−8

0 10 20 30 40

β0
Sampling distribution

I y = X β + , where ∼ N(0, σ 2 I n )

I OLS estimator is β̂ = (X T X )−1 X T y

I Sampling distribution of β̂ is

β̂ ∼ N(β, σ 2 (X T X )−1 )

I Residual Sum of Square is

RSS = (y − X β̂)T (y − X β̂)

In addition,
RSS ∼ σ 2 χ2n−p
Statistical Inference for β
I For i th predictor,

β̂ − βi
qi ∼ N(0, 1)
σ (X T X )−1
ii

I From the χ2 distribution of RSS we have

(n − p)s2
∼ χ2n−p ,
σ2
RSS
where s2 = n−p , this implies

RSS
E = σ2,
n−p

i.e., s2 is an unbiased estimator of σ 2 .

Statistical Inference for β
I Note that in the sampling distribution of β̂, the σ 2 is
unknown

I As we estimate the σ 2 by its corresponding unbiased

estimator s2 = RSS
n−p ,

β̂i − βi
t= q ∼ tn−p ,
T −1
s (X X )ii
q
where s (X T X )−1
ii is the standard error of β̂i

I To test null hypothesis H0 : βi = 0 (predictor Xi has no

impact on the dependent variable y ) - we can use the
statistic t.
Statistical Inference for β

I To test null hypothesis H0 : βi = 0

(predictor Xi has no impact on the dependent variable y )

I Alternate hypothesis HA : βi 6= 0
(predictor Xi has impact on the y )

I Under the H0 , test statistics is

β̂i − 0
t= q ∼ tn−p
s (X T X )−1
ii

At 100 × α%, level of significane, if tobserved > tn−p (α) or

tobserved < −tn−p (α) then we reject null hypothesis.
Statistical Inference for β
I H0 : βi = 0 vs HA : βi 6= 0
I Under the H0 , test statistics is

β̂i − 0 β̂i − 0
t= q = ∼ tn−p
s (X T X )−1 se(β̂i )
ii

I The p-value is the probability of obtaining test results at

least as extreme as the observed result, assuming that the
null hypothesis is correct.

I P-value = 2 ∗ P(t > |toberved ||H0 is true)

I If the P-value is too small – we reject the null hypothesis.
I Otherwise we say we fail to reject null hypothesis
Does wt has statistically significant effect on mpg?
I mpg=β0 + β1 wt+

I H0 : β1 = 0 vs HA : β1 6= 0
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.285 1.878 19.858 0
wt -5.344 0.559 -9.559 0

I β̂1 = −5.344 and se(β̂1 ) = 0.559, and

β̂1 − 0 −5.344 − 0
= = −9.559
se(β̂1 ) 0.559

and p-value < 0.01

I weight has statistically significant effect on mpg.

Does wt, and/or hp has statistically significant effect
on mpg?
I mpg=β0 + β1 wt+β2 hp+

I H0 : β1 = 0 vs HA : β1 6= 0
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.227 1.599 23.285 0.000
wt -3.878 0.633 -6.129 0.000
hp -0.032 0.009 -3.519 0.001

I β̂1 = −3.878 and se(β̂1 ) = 0.633, and under H0 ,

β̂1 − 0 −3.878 − 0
t-value = = = −6.129
se(β̂1 ) 0.633

and p-value < 0.01

I weight has statistically significant effect on mpg.

Compare the two models

Model 1 mpg=β0 + β1 wt+

Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.285 1.878 19.858 0
wt -5.344 0.559 -9.559 0
Model 2 mpg=β0 + β1 wt+β2 hp+
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.227 1.599 23.285 0.000
wt -3.878 0.633 -6.129 0.000
hp -0.032 0.009 -3.519 0.001
1. Model 1 is a 2D model, and Model 2 is a 3D model: Are
they comparable?
2. The se(β̂1 ) in Model 2 is higher than Model 1. Why?
I Lets discuss this issue.
Let’s understand the Model Complexity
Linear Regression: mpg = β0 + β1 hp +

30
mpg

20
10

50 150 250

R-squared = 0.602 RMSE = 3.74

Let’s understand the Model Complexity
Quadratic Regression: mpg = β0 + β1 hp + β2 hp2 +

10 15 20 25 30 35

hp2
mpg

120000
100000
80000
60000
40000
20000
0
50100150200250300350

hp
Let’s understand the Model Complexity
Quadratic Regression: mpg = β0 + β1 hp + β2 hp2 +

30
mpg

20
10

50 150 250

R-squared = 0.756 RMSE = 2.93

Let’s understand the Model Complexity
Polynomial Regression of order 3: mpg = β0 + β1 hp + β2 hp2 + β3 hp3 +

30
mpg

20
10

50 150 250

R-squared = 0.761 RMSE = 2.903

Let’s understand the Model Complexity

Model 1 R-squared = 0.602 RMSE = 3.74

Model 2 R-squared = 0.756 RMSE = 2.93
Model 3 R-squared = 0.761 RMSE = 2.903
Model Complexity

M1 Regression Line
mpg = β0 + β1 hp +

M2 Regression Plane
mpg = β0 + β1 hp + β2 hp2 +

M3 Regression 3-dimension hyper plane

mpg = β0 + β1 hp + β2 hp2 + β3 hp3 +

M3’ Regression 3-dimension hyper plane

mpg = β0 + β1 hp + β2 wt + β3 disp+
Quiz: Model Complexity

I Should we blindly increase the Model Complexity?

Quiz: Model Complexity

I Should we blindly increase the Model Complexity?

I We have to be careful about the bias-variance

trade-off.
What is multicollinearity?

I Consider the standard linear model

y = X β + ,

where ∼ N(0, σ 2 I n ) and n > p

I This implies y ∼ N(X β, σ 2 I n )

I The least square estimator of β is β̂ = (X T X )−1 X T y

I The sampling distribution of β̂ is

β̂ ∼ Np (β, σ 2 (X T X )−1 )
What is multicollinearity?

I If correlation between two predictors of X is 1, that means

one column is exactly dependent on other, that will result
det(X T X ) = 0

I Heance X T X will not be invertible, (because

T
(X T X )−1 = Adj(X T X )
det(X X )

I In such case unique solution does not exists.

Why multicollinearity is a problem?

I If correlation between two predictors of X is nearly 1 or -1,

but not exactly 1.
I For example cor (Xi , Xj ) = 0.99 - what happens then?

I det(X T X ) = δ > 0, where δ is a very small value.

I X T X is invertible - but every element of (X T X )−1 will be

very large.

I Unique solution β̂ exists but Cov (β̂) = σ 2 (X T X )−1 will be

extremely large - so standard error will be very large.
Hence valid statistical inference cannot be implemented.
Correlated Predictors

I We consider simple no-intercept model:

mpg=β1 wt+β2 drat+

I ρ(wt,drat)= −0.71
Sampling distribution for β0 and β1
OLS Estimator induces ρ(β̂1 , β̂2 ) = −0.92

10
9

5
2
8
β2

7
6
5
4

−5 −4 −3 −2 −1 0 1 2

β1
Sampling distribution for β0 and β1
OLS Estimator induces ρ(β̂1 , β̂2 ) = −0.92

10
9

5
2
8
β2

7
6
5
4

−5 −4 −3 −2 −1 0 1 2

β1
Sampling distribution for β0 and β1
Ridge Estimator induces ρ(β̂1 , β̂2 ) = −0.73

10
9

10
8

2
β2

5
6
5
4

−5 −4 −3 −2 −1 0 1 2

β1
Thank You
[email protected]

Mathematical Statistics basic ideas and selected topics Volume I Second Edition Bickel Peter J. all chapter instant download
100% (1)
Mathematical Statistics basic ideas and selected topics Volume I Second Edition Bickel Peter J. all chapter instant download
77 pages
The Hundred-Page Machine Learning Book - Andriy Burkov
No ratings yet
The Hundred-Page Machine Learning Book - Andriy Burkov
16 pages
MATH6183 Introduction+Regression
No ratings yet
MATH6183 Introduction+Regression
70 pages
Analytics Compendium
No ratings yet
Analytics Compendium
41 pages
ML Linear Model
No ratings yet
ML Linear Model
10 pages
Lec9 - Linear Models
No ratings yet
Lec9 - Linear Models
44 pages
W1.2_Regression_1
No ratings yet
W1.2_Regression_1
28 pages
Progression Linaire
No ratings yet
Progression Linaire
187 pages
ML-2
No ratings yet
ML-2
155 pages
lecture3_supervised_learning_I
No ratings yet
lecture3_supervised_learning_I
84 pages
W2 Ecs7020p
No ratings yet
W2 Ecs7020p
54 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
Week 4 Linear Regression
No ratings yet
Week 4 Linear Regression
38 pages
lecture 3
No ratings yet
lecture 3
33 pages
ML_Introduction
No ratings yet
ML_Introduction
76 pages
Lecture 2
No ratings yet
Lecture 2
23 pages
unit-2.pptx
No ratings yet
unit-2.pptx
133 pages
Lecture 3_Regression (1)
No ratings yet
Lecture 3_Regression (1)
47 pages
Regression
No ratings yet
Regression
45 pages
Wk05 machine learning
No ratings yet
Wk05 machine learning
6 pages
Forecasting and Learning Theory
No ratings yet
Forecasting and Learning Theory
46 pages
Lecture 09_02.09.2024_Regression-01
No ratings yet
Lecture 09_02.09.2024_Regression-01
62 pages
Linear Regression
No ratings yet
Linear Regression
31 pages
Revisiting Revisiting Logistic Regression & Naïve Logistic Regression & Naïve Bayes Bayes
No ratings yet
Revisiting Revisiting Logistic Regression & Naïve Logistic Regression & Naïve Bayes Bayes
46 pages
Lecture 2
No ratings yet
Lecture 2
66 pages
Today: - Calculus
No ratings yet
Today: - Calculus
61 pages
2.linear Regression
No ratings yet
2.linear Regression
49 pages
Week 6 - Lecture 12-1
No ratings yet
Week 6 - Lecture 12-1
34 pages
Intro To ML RevisionNotes
No ratings yet
Intro To ML RevisionNotes
24 pages
Supervised Machine Learning - Regression
No ratings yet
Supervised Machine Learning - Regression
34 pages
Fiches Machine Learning
No ratings yet
Fiches Machine Learning
21 pages
StatLearning2r PDF
No ratings yet
StatLearning2r PDF
267 pages
Week 9 - PROG 8510 Week 9
No ratings yet
Week 9 - PROG 8510 Week 9
27 pages
Predictive ModellingAnalytics
No ratings yet
Predictive ModellingAnalytics
27 pages
Ch-2 Supervised Machine Learning
No ratings yet
Ch-2 Supervised Machine Learning
48 pages
ML UNIT II
No ratings yet
ML UNIT II
30 pages
CS550 Regression
No ratings yet
CS550 Regression
62 pages
AAI Lecture 10 Sp 25
No ratings yet
AAI Lecture 10 Sp 25
37 pages
LP III Lab Manual
100% (1)
LP III Lab Manual
8 pages
Chapter 6 Supervised Learning
No ratings yet
Chapter 6 Supervised Learning
6 pages
Statistical Learning
No ratings yet
Statistical Learning
31 pages
Module 5
No ratings yet
Module 5
48 pages
Machine Learning Lecture1
No ratings yet
Machine Learning Lecture1
56 pages
CS550 Lec2
No ratings yet
CS550 Lec2
24 pages
LinearRegression PDF
No ratings yet
LinearRegression PDF
4 pages
Cp4252 Ml Unit-II
No ratings yet
Cp4252 Ml Unit-II
44 pages
Machine Learning
No ratings yet
Machine Learning
19 pages
Unit Iii
No ratings yet
Unit Iii
27 pages
Lecture 1 - Overview of Supervised Learning
No ratings yet
Lecture 1 - Overview of Supervised Learning
133 pages
Lecture 6 - Ridge Regression, Polynomial Regression (DONE!!) PDF
No ratings yet
Lecture 6 - Ridge Regression, Polynomial Regression (DONE!!) PDF
26 pages
Linear Regression
No ratings yet
Linear Regression
62 pages
LinearRegression1 210720 171800
No ratings yet
LinearRegression1 210720 171800
41 pages
Machine learning
No ratings yet
Machine learning
62 pages
Aiml Unit 3
No ratings yet
Aiml Unit 3
9 pages
8. Linear Regression
No ratings yet
8. Linear Regression
29 pages
eng
No ratings yet
eng
10 pages
Lecture16 Crossvalidation
No ratings yet
Lecture16 Crossvalidation
32 pages
ML Unit3
No ratings yet
ML Unit3
9 pages
Factoring and Algebra - A Selection of Classic Mathematical Articles Containing Examples and Exercises on the Subject of Algebra (Mathematics Series)
From Everand
Factoring and Algebra - A Selection of Classic Mathematical Articles Containing Examples and Exercises on the Subject of Algebra (Mathematics Series)
CSPacademic
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Generalized Fermat Equation
From Everand
Generalized Fermat Equation
Ran Van Vo
No ratings yet
On The Selection Stability of Stability Selection
No ratings yet
On The Selection Stability of Stability Selection
20 pages
11.B.sc. Statistics
No ratings yet
11.B.sc. Statistics
59 pages
Lecture 9 ReKm Statistical Inference Hypothesis Testing Parametric Test
No ratings yet
Lecture 9 ReKm Statistical Inference Hypothesis Testing Parametric Test
13 pages
Module 22 Confidence Interval Estimation of The Population Mean
No ratings yet
Module 22 Confidence Interval Estimation of The Population Mean
5 pages
An R Package For High-Frequency Traders
No ratings yet
An R Package For High-Frequency Traders
24 pages
Structural Modeling by Example Applications in Educational Sociological and Behavioral Research 1st Edition Peter Cuttance - Read the ebook online or download it to own the full content
100% (2)
Structural Modeling by Example Applications in Educational Sociological and Behavioral Research 1st Edition Peter Cuttance - Read the ebook online or download it to own the full content
47 pages
IIT JAM Mathematical Statistics Syllabus
No ratings yet
IIT JAM Mathematical Statistics Syllabus
4 pages
Burnham and Anderson 2004 Multimodel Inference
No ratings yet
Burnham and Anderson 2004 Multimodel Inference
44 pages
Estimation and Detection: Lecture 1: Introduction and Minimum Variance Unbiased Estimators
No ratings yet
Estimation and Detection: Lecture 1: Introduction and Minimum Variance Unbiased Estimators
17 pages
Probability and Statistics B
No ratings yet
Probability and Statistics B
1 page
Instant download Optimal statistical inference in financial engineering 1st Edition Masanobu Taniguchi pdf all chapter
100% (6)
Instant download Optimal statistical inference in financial engineering 1st Edition Masanobu Taniguchi pdf all chapter
51 pages
Stat A Cheat Sheets
No ratings yet
Stat A Cheat Sheets
6 pages
econometrics final
No ratings yet
econometrics final
13 pages
FIN213 - Semester Test 2 Solutions Memo 20240503
No ratings yet
FIN213 - Semester Test 2 Solutions Memo 20240503
13 pages
Unit-2 Point Estimation
No ratings yet
Unit-2 Point Estimation
8 pages
Chapter 6
No ratings yet
Chapter 6
7 pages
Artificial Intelligence and Pattern Recognition Question Bank
100% (1)
Artificial Intelligence and Pattern Recognition Question Bank
5 pages
Estimates 8.2 Users Guide
No ratings yet
Estimates 8.2 Users Guide
39 pages
Stat Cluster Sampling
No ratings yet
Stat Cluster Sampling
22 pages
Module 21 Point Estimation of The Population
No ratings yet
Module 21 Point Estimation of The Population
5 pages
ST3188 Exam Commentary - October 2023
No ratings yet
ST3188 Exam Commentary - October 2023
17 pages
Validation Study of Human Figure Drawing Test
100% (1)
Validation Study of Human Figure Drawing Test
15 pages
Report e
No ratings yet
Report e
89 pages
Dynamic Model Identification For Industrial Robots: Integrated Experiment Design and Parameter Estimation
No ratings yet
Dynamic Model Identification For Industrial Robots: Integrated Experiment Design and Parameter Estimation
14 pages
Dr. Sufian M. Salih / Introduction and Basic Concepts in Biostatistical
No ratings yet
Dr. Sufian M. Salih / Introduction and Basic Concepts in Biostatistical
20 pages
hernán-et-al-2025-the-target-trial-framework-for-causal-inference-from-observational-data-why-and-when
No ratings yet
hernán-et-al-2025-the-target-trial-framework-for-causal-inference-from-observational-data-why-and-when
7 pages
Introduction to the mathematical and statistical foundations of econometrics Bierens All Chapters Instant Download
100% (12)
Introduction to the mathematical and statistical foundations of econometrics Bierens All Chapters Instant Download
50 pages
NPV Scheduler 4 White Paper
No ratings yet
NPV Scheduler 4 White Paper
20 pages
Festing 2014
No ratings yet
Festing 2014
5 pages

Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research

Uploaded by

Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research

Uploaded by

Understanding the Geometry of Predictive

Chennai Mathematical Institute

10-11th June, 2024

I Data Mining; Concepts and Techniques, Jiawei Han and

Ex 1 Given the different features of a new prototype car, can you

I mpg cyl disp hp

I We are going to use mtcars data set in R.

Ex 2 Given the credit history and other features of a loan

I Note that your objective is to predict the label of the loan

I Ask a question to your client or collaborator: "Do you want

I If the answer is ’yes’ - then ask which variable?

I Check if that variable is available in the database.

I if yes - then you can consider it as a predictive analytics

I Supervised learning algorithms are trained using labeled

I For example, a piece of equipment could have data points

I Objective: Learn f (.)

1. Regression : target variable y is continuous variable - e.g.,

2. Classification: target variable y is categorical or label

x11 x12 ... x1p y1

I Dtrain = (X , y ), is the traing dataset, where X is the matrix

x11 x12 ... x1p G1

I Qualitative variables are also referred to as categorical or

Ex Given the different features of a new prototype car, can you

I mpg cyl disp hp wt

I Note that your objective is to predict the variable mpg.

I Given a vector of inputs X T = (X1 , X2 , . . . , Xp ), we predict

The term β0 is the intercept, also known as the bias in

I We have data about y and X

I How can we estimate β = (β1 , · · · , βp )?

I Given a vector of inputs X n×p = ((Xij )), we predict the

y n×1 = X n×p β p×1 + n×1 .

I How can we estimate β = (β1 , · · · , βp )?

I Many different methods, most popular is least squares.

I minimize the residual sum of squares

I RSS(β) is a quadratic function of the parameters

I Its minimum always exists, but may not be unique.

I Differentiate RSS(β) with respect to β and equate to 0

I Suppose that for a known matrix Ap×p and vector bp×1 , we

I The standard approach is ordinary least squares linear

where ||.|| is the Euclidean norm.

I If rank (A|b) = rank (A) then at least one solution exists.

I If rank (A|b) = rank (A) = p, that is A is a full-rank matrix,

I If rank (A|b) = rank (A) < p, that is A is a less than full-rank

I Whatever may be your data, irrespective of that, normal

I Atleast one solution always exists - if you adopt least

I If X T X is nonsingular, i.e., rank (X T X ) = p, then the

I What about mean absolute deviation?

I What about mean absolute deviation?

I Conceptually no problem - certainly you can do that.

I Given a vector of inputs X n×p = ((Xij )), we predict the

y n×1 = X n×p β p×1 + n×1 .

I X n×p known as design matrix typically are considered as

I , (also known as error / residuals) for all i are random

2. Var (i ) = E(2i ) = σ 2 , ∀ i Homoscedasticity

3. Cov (i , j ) = E(i j ) = 0, ∀ i 6= j Independence

I Given a vector of inputs X n×p = ((Xij )), we predict the

y n×1 = X n×p β p×1 + n×1 .

I X n×p known as design matrix typically are considered as

I , (also known as error / residuals) for all i are random

E(y) = E(X β + ) = X β + E() = X β

I We will introduce that assumption little later.

E(β̂) = E((X T X )−1 X T y)

Result 2 OLS estimator β̂ is an unbiased estimator of β.

I Suppose we are interested in some linear combination of

Result 3 Then the unbiased estimatior of c T β is c T β̂, i.e.,

I Suppose c = x0 is a test point. Then we are interested in

I If we have any other linear estimator θ̃ = aT y is unbiased

Note OLS estimates of the parameters β have the smallest

MSE(θ̃) = E(θ̃ − θ)2

I The Gauss-Markov theorem implies that the least squares

I However, there may well exist a biased estimator with

I MSE is directly related to prediction accuracy.

I Consider the prediction of the new response at input x0

I The expected prediction error of an estimate f̂ (x0 ) = x0T β̂ is

E(y0 − f̂ (x0 ))2 = σ 2 + E(x0T β̂ − f (x0 ))2

I Expected prediction error and MSE differ

y n×1 = X n×p β p×1 + n×1 .

y n×1 = X n×p β p×1 + n×1 .

I , (also known as error / residuals) for all i are random

2. Var (i ) = E(2i ) = σ 2 , ∀ i Homoscedasticity

3. Cov (i , j ) = E(i j ) = 0, ∀ i 6= j Independence

y n×1 = X n×p β p×1 + n×1 .

I , (also known as error / residuals) for all i are random

E(y) = E(X β + ) = X β + E() = X β

β = (β0 , β1 )T and = (1 , 2 , . . . , n )T

β = (β0 , β1 , β2 )T and = (1 , 2 , . . . , n )T

where ∼ N(0, σ 2 I n ) and n > p

Model 1 mpg=β0 + β1 wt+

where ∼ N(0, σ 2 I n ) and n > p

mpg=β1 wt+β2 drat+