0% found this document useful (0 votes)

41 views29 pages

Probabilistic Discriminative Models

This document discusses probabilistic discriminative models and linear classification models. It covers logistic regression, which models the posterior probability of a class using a logistic sigmoid function of a linear combination of features. Maximum likelihood is used to determine the logistic regression model parameters. The document also describes using iterative reweighted least squares to efficiently find the maximum likelihood solution for logistic regression since there is no closed-form solution. Nonlinear kernel methods can also be used to transform inputs to allow modeling of nonlinear decision boundaries.

Uploaded by

Sridarshini Vikkram

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views29 pages

Probabilistic Discriminative Models

Uploaded by

Sridarshini Vikkram

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 29

Probabilistic discriminative models

Linear models for classification

FC - Fortaleza
Probabilistic discriminative models
Linear models for classification
Probabilistic discriminative models

For the two-class classification problem, the posterior probability of class

C1 can be written as a logistic sigmoid acting on a linear function of x

p(x|C1)p(C1) T
p(C1|x) = σ ln 2 2 = σ(w x + w0 )
p(x|C )p(C )

◮ for a wide choice of class-conditional distributions p(x|Ck )

For the multi-class case, the posterior probability of class Ck

is given by a softmax transformation of a linear function
of x
p(x|Ck )p(Ck ) exp (wkT x + wk0 )
p(Ck |x) = Σ K = Σ K
j p(x|Cj )p(Cj ) j exp (w Tj x + wj 0)
=1 =1
Probabilistic discriminative models (cont.)

For specific choices of class-conditionals p(x|Ck ), maximum likelihood can be used to

determine the parameters of the densities and the class priors p(Ck )

◮ Bayes’ theorem is then used to find posterior class probabilities p(Ck |x)

An alternative approach is to use the functional form of the generalised

linear model explicitly and determine its parameters directly by maximum
likelihood
◮ There is an efficient algorithm finding such solutions
◮ Iterative re-weighted least squares, IRLS
Probabilistic discriminative models (cont.)

The indirect approach to find parameters of a generalised linear model, by

fitting class-conditional densities and class priors separately and then by
applying Bayes’ theorem, represents an example of generative modelling
◮ We could take such a model and generate synthetic data
by drawing values of x from the marginal distribution p(x)

In the direct approach, we maximise a likelihood function defined through the

conditional distribution p(Ck |x), this is a form of discriminative training
◮ One advantage of the discriminative approach is that there
will typically be fewer adaptive parameters to be determined
◮ It may also lead to improved predictive performance, particularly
when the class-conditional density assumptions give a poor
approximation to the true distributions
Fixed basis functions
Probabilistic discriminative models
Fixed basis functions

We considered classification models that work with the original input vector x
However, all of the algorithms are equally applicable if we first make a fixed
nonlinear transformation of the inputs using a vector of basis functions φ(x)

The resulting decision boundaries will be linear in the feature space φ, and
these correspond to nonlinear decision boundaries in the original x space
◮ Classes that are linearly separable in the feature space φ(x) need
not be linearly separable in the original observation space x
Fixed basis functions (cont.)

1
1

φ2
x2

0 0.5

−1
0

−1 0 1 0 0.5 1
x1 φ1

Original input space (x1 , x2 ) together with points from two classes (red/blue)
◮ Two ‘Gaussian’ basis functions φ1 (x) and φ2(x) are defined in this space
with centres (green crosses) and with contours (green circles)

Feature space (φ1, φ2) together with the linear decision boundary (black line)
◮ Nonlinear decision boundary in the original input space (black curve)
Fixed basis functions (cont.)

Often, there is significant overlap between class-conditional densities p(x|Ck )

◮ This corresponds to posterior probabilities p(Ck |x), which are not 0 or 1

◮ At least, for some values of x

In such cases, the optimal solution is obtained by modelling the posterior

probabilities p(Ck |x) accurately and then applying standard decision theory

Note that nonlinear transformations φ(x) cannot remove such class

overlap
◮ Indeed, they can increase the level of overlap, or even create
overlap where none existed in the original observation space

However, suitable choices of nonlinearity can often make

the process of modelling the posterior probabilities easier
Notwithstanding these limitations, models with fixed
Logistic regression
Probabilistic discriminative models
Logistic regression

When considering the two-class problem using a generative approach and

under general assumptions, the posterior probability of class C1 can be written as
◮ a logistic sigmoid on a linear function of the feature vector φ so
that

p(C1|φ) = y (φ) = σ(w T φ) with p(C2|φ) = 1

− p(C1|φ) (1)

◮ The logistic σ(a)

sigmoid
= function
1 is defined
with a =as p(x|C1)p(C1)
1 + exp (−a) ln p(x|C2)p(C2)

In the terminology of statistics this model is known as logistic regression

◮ For an M-dimensional feature space φ, the model has M parameters
Logistic regression (cont.)

To fit Gaussian class conditional densities with maximum likelihood, we need

◮ 2M + M(M + 1)/2 parameters for means and (shared) covariance matrix
And a total of M ( M + 5)/2 + 1 parameters, if we include the class prior p(C1)
◮ The number of parameters grows quadratically with M

For the M parameters of logistic regression model, we use maximum

likelihood to determine the parameters
Logistic regression (cont.)

For data {φn, tn } N with tn = {0, 1 } and φn = φ(xn ), the likelihood function
n=1

p(t|w) = Y ytn (1 − yn 1− tn
N
(2)
n
) n=1

is written for t = (t1, . . . , t N ) T and yn = p(C1|φn )

By taking the negative log of the likelihood, our error function is defined
by N Σ
(3)
E (w) = − ln p(t|w) = − t n ln (yn ) + (1 − t n ) ln (1 −
n=1
yn )
which is the cross-entropy error function with yn = σ(an) and an = w T φn
Logistic regression (cont.)

By taking the gradient of the error function with respect to w, we

get
(4)
∇E (w) = ΣN (yn − t n )φ n
n=1

The contribution to the gradient from point n comes from the error (yn − t n ) between

target value and model prediction, times the basis function vector φn
◮ The gradient takes the same form as the gradient of
the sum-of-squares error function for linear
regression models
Logistic regression (cont.)
P robabilistic discriminative
models

Maximum likelihood can show severe over-fitting for linearly separable

datasets
◮ The MLE solution occurs when the hyperplane for σ = 0.5, or w T φ
= 0, separates the two classes and the magnitude of w goes to
infinity
◮ The logistic sigmoid becomes infinitely steep (Heaviside) in feature
space, and every point from each class k gets a posterior probability p(Ck |x) =
1

There is also a continuum of such solutions because any separating

hyperplane gives rise to the same posterior probabilities at the
training data points
◮ Maximum likelihood does not favour one such solution over another
◮ The solution depends on the optimisation algorithm and initialisation

One possibility would be to introduce a prior over w and finding a M AP

solution
◮ Add a regularisation term to the error function
Iterative reweighted least squares
Probabilistic discriminative models
Iterative reweighted least squares

In the case of the linear regression models, the maximum likelihood

solution, on the assumption of a Gaussian noise model, leads to a closed-
form solution
◮ A consequence of quadratic dependence of log likelihood function on w

For logistic regression, due to the nonlinearity of the logistic sigmoid

function
◮ There is no longer a closed-form solution
◮ Departure from quadratic is not substantial
Specifically, the error function is concave, and hence it has a unique
minimum

Furthermore, the error function can be minimised by an efficient

iterative technique based on the Newton-Raphson iterative
optimisation scheme
◮ A local quadratic approximation to the log likelihood function
Iterative reweighted least squares (cont.)
The Newton-Raphson update, for minimising a function E (w), takes the form
−1
w (new) = w (old) − H ∇E (w)
(5)

H is the Hessian matrix, with elements the second derivatives of E (w) wrt w

We apply the Newton-Raphson method to

1. the sum-of-squares error function (linear regression model)

1 ΣN 2
ED (w) = 2 t n − w T φ(xn )
n=1

2. the cross-entropy error function (logistic regression model)

ΣN
E (w) = − ln p(t|w) =
− t n ln (yn ) + (1 − t n ) ln (1 −

yn )
n=1
Iterative reweighted least squares (cont.)
Gradient and Hessian of the sum-of-squares error function are

ΣN
∇E (w) (6)
= (w T φ n − t n )φ n = Φ T Φ w − Φ T t
n=1
ΣN
H = ∇∇E (w) (7)
= φ nφ T = Φ T Φ
n=1
where Φ is the N × M design matrix with nφ T in the n-th
row
The Newton-Raphson update takes the form

w (new) = w (old) − (Φ T Φ )− 1 (Φ T Φ w (old) − Φ T t)

= (Φ T Φ )− 1 Φ T t (8)

which is the classical/standard least-squares solution

The error function is quadratic, N-R formula gets the exact solution in one step
Iterative reweighted least squares (cont.)

Gradient and Hessian of the cross-entropy error function are

ΣN
∇E (w) (9)
= (yn − t n )φ n = Φ T (y − t)
n=1
(10)
∇∇E (w) = Σ
N
H = yn (1 − ynφ nφnT ) = Φ T RΦ
n=1

where R(w) is a N × N diagonal matrix with (n, n)

elements
(11)
Rnn = yn (1 −

yn )
The Hessian is no longer constant, depends on w through weighting matrix R
Iterative reweighted least squares (cont.)

Because 0 < yn < 1, for an arbitrary vector u, we have that u T Hu > 0

◮ H is positive definite
The error function is concave in w and hence it has a unique minimum

The Newton-Raphson update formula becomes

= w (old) − (Φ T RΦ )− 1Φ T (y − t)
w(new)
= (Φ T RΦ )− 1 Φ T RΦ w (old) −
Φ T (y − t)
(12)
= (Φ T RΦ )− 1 Φ T Rz
where z is a N-vector with
elements
z = Φ w (old) − R− 1 (y − (13)
t)
Iterative reweighted least squares
(cont.)

wnew = (Φ T RΦ) − 1 Φ T Rz with z = Φw(old) − R−1 (y − t)

The update is the set of normal equations for a weighted least-squares problem

Because the weighing matrix R is not constant but depends on the

parameter vector w, we must apply the normal equations iteratively
◮ each time using the new weight vector w to compute revised weights R

For this reason, the algorithm is iterative reweighted least squares, or IRLS
Iterative reweighted least squares (cont.)
As in weighted least-squares problems, the elements of the diagonal weighting
matrix R can be interpreted as variances because the mean and variance of t
(t 2 = t , for t ∈ { 0, 1} ) in the logistic regression model are

E[t] = σ(x) = y (14)

var[t ] = E[t 2 ] − E[t ]2 = σ(x) − σ(x)2 = y (1 (15)
−y)
We interpret IRLS as solution to a linearised problem in the space of a = w T φ

The quantity zn (n-th element of z) can then be given an interpretation as an effective

target value in this space by making a local linear approximation to the
logistic sigmoid function around the current operating point w(old)
dan..
an (w) ≃ anw(old) +
. (t n −
dyn w(old)

T (old) (yynn−)
= φ
n w − t n) = zn (16)
yn (1 − y
) n
Multiclass logistic regression
Probabilistic discriminative models
Multiclass logistic regression

In the discussion of generative models for multiclass classification, we have

seen that for a large class of distributions, the posterior probabilities are
given by a softmax transformation of linear functions of feature variables

exp (ak ) (17)

p(Ck |φ) = yk (φ) = Σ
j exp (aj )

where the activations ak are

(18)
ak = w Tk φ

We used maximum likelihood to determine separately the class-conditional

densities and the class priors and then found the corresponding posterior
probabilities using Bayes’ theorem, implicitly determining parameters {w k }
Multiclass logistic regression (cont.)
We can use maximum likelihood to get parameters {w k } of this model directly To do this,

we need the derivatives of yk with respect to all of the activations aj

∂yk = yk (Ikj − y )
∂aj
(19)j
where Ikj are the elements of the identity matrix

Next we need to write the likelihood function using the 1-of-K coding scheme
◮ The target vector t n for feature vector φn belonging to class Ck
is a binary vector with all elements zero except for element k
The likelihood is then given by

Y
N

Y
K
Multiclass logistic regression (cont.)

Taking the negative logarithm gives

ΣN ΣK

E (w1, . . . , w K ) = − ln p(T|w1, . . . , w K ) = − tnk ln (ynk ) (21)

n=1 k=1

This is the cross-entropy error function for the multiclass classification problem

We now take the gradient of the error function wrt to one parameter vector wj
ΣN
(22)
∇wj E (w 1, . . . , w K ) = (ynj −

t nj )φ n ∂yk
We used the result for derivatives of the n=1 kj −
softmax function, = yk (I
∂aj
y) 1 j

Σ
1
We used also t = 1
k nk
Multiclass logistic regression (cont.)

The same form for the gradient as found for the sum-of-squares error function
with the linear model and the cross-entropy error for logistic regression model
◮ T he product of the error (ynj − t nj ) times the basis function φ n

The derivative of the log likelihood function for a linear regression model with
respect to the parameter vector w for a data point n took the same form
◮ T he error (yn − t n ) times the feature vector φ n

Similarly, for the combination of logistic sigmoid activation function and

cross-entropy error function, and for the softmax activation function with the
multiclass cross-entropy error function, we again obtain this same simple
form
Multiclass logistic regression (cont.)

To find a batch algorithm, we can use the Newton-Raphson update to

obtain the corresponding IRLS algorithm for the multiclass problem

This requires evaluation of the Hessian matrix that comprises

blocks of size M × M in which block (j , k ) is given by

ΣN
∇w kj∇w E (w1, . . . , wk ) = − ynk (Ikj − ynj )φ n φn T (23)
n=1

As with two-classes, the Hessian matrix for the multiclass logistic regression
models is positive definite and the error function has a unique minimum

07 - Linear Models For Classification
No ratings yet
07 - Linear Models For Classification
76 pages
Intro To Machine Learning Lecture Notes2
No ratings yet
Intro To Machine Learning Lecture Notes2
7 pages
Unit 3
No ratings yet
Unit 3
8 pages
Week3 Summary Detail
No ratings yet
Week3 Summary Detail
13 pages
Unit 2 - ML - SRM
No ratings yet
Unit 2 - ML - SRM
89 pages
Cs221 Section2 Solutions
No ratings yet
Cs221 Section2 Solutions
7 pages
Unit 3
No ratings yet
Unit 3
8 pages
Scaffold Erection NC2 Cert
No ratings yet
Scaffold Erection NC2 Cert
1 page
ECE 449 Notes
No ratings yet
ECE 449 Notes
5 pages
Lecture3 Logistic Regression Classifier V0
No ratings yet
Lecture3 Logistic Regression Classifier V0
41 pages
2021 Logistic Regression
No ratings yet
2021 Logistic Regression
33 pages
Pattern Recognition and Deep Learning Linear Models For Classification
No ratings yet
Pattern Recognition and Deep Learning Linear Models For Classification
59 pages
ML-chap10 2024 110300
No ratings yet
ML-chap10 2024 110300
29 pages
Output 23
No ratings yet
Output 23
6 pages
Output 25
No ratings yet
Output 25
8 pages
Data Mining: Classification
No ratings yet
Data Mining: Classification
79 pages
Machine Learning for Engineering Students
No ratings yet
Machine Learning for Engineering Students
31 pages
2023 LSE MY474 Applied Machine Learning Social Science, Lecture3
No ratings yet
2023 LSE MY474 Applied Machine Learning Social Science, Lecture3
58 pages
04 - Linear-Classification-2024
No ratings yet
04 - Linear-Classification-2024
65 pages
Classification
No ratings yet
Classification
47 pages
01 - Practical Guide To Ergonomics in Industrial Design
No ratings yet
01 - Practical Guide To Ergonomics in Industrial Design
65 pages
Credit Card Fraud Detection Using Logistic Regression
No ratings yet
Credit Card Fraud Detection Using Logistic Regression
36 pages
Logistic Regression
No ratings yet
Logistic Regression
61 pages
Unit 2 - ML - SRM
No ratings yet
Unit 2 - ML - SRM
66 pages
Logistic Regression (Probability Concepts) and Perceptron
No ratings yet
Logistic Regression (Probability Concepts) and Perceptron
20 pages
Lecture 5 - Logistic Regression
No ratings yet
Lecture 5 - Logistic Regression
28 pages
Lecture Notes 6 Logistic Regression
No ratings yet
Lecture Notes 6 Logistic Regression
8 pages
A Layman's Guide To The Project
No ratings yet
A Layman's Guide To The Project
34 pages
LR, Decision Tree
No ratings yet
LR, Decision Tree
48 pages
IoT Data Analytics for Tech Enthusiasts
No ratings yet
IoT Data Analytics for Tech Enthusiasts
27 pages
Môn Listening
No ratings yet
Môn Listening
19 pages
4 Linear Regression Additional Notes
No ratings yet
4 Linear Regression Additional Notes
8 pages
Crosss Sectional Areas - by Sections Part 1
No ratings yet
Crosss Sectional Areas - by Sections Part 1
5 pages
Slide 2
No ratings yet
Slide 2
30 pages
Week 4 Logistic
No ratings yet
Week 4 Logistic
21 pages
BE EEC SchemeMarch3
No ratings yet
BE EEC SchemeMarch3
45 pages
Logistic Regression
No ratings yet
Logistic Regression
8 pages
Supervised Learning Cheatsheet
No ratings yet
Supervised Learning Cheatsheet
4 pages
SDO Navotas SHS DISS FirstSem FV
No ratings yet
SDO Navotas SHS DISS FirstSem FV
100 pages
Alfa Laval P615
No ratings yet
Alfa Laval P615
654 pages
Machine Learning - Unit 2
No ratings yet
Machine Learning - Unit 2
104 pages
Generalized Linear Model
No ratings yet
Generalized Linear Model
67 pages
Time Table Summer 2024 SMME V1.1
No ratings yet
Time Table Summer 2024 SMME V1.1
1 page
MBA Dissertation - Final-University of Cumbria
No ratings yet
MBA Dissertation - Final-University of Cumbria
77 pages
Lecture 04
No ratings yet
Lecture 04
28 pages
M02Logistic Regression Logistic RegressioLogistic Regressionn
No ratings yet
M02Logistic Regression Logistic RegressioLogistic Regressionn
19 pages
Semiconductor Physics and Devices
No ratings yet
Semiconductor Physics and Devices
2 pages
I2ml3e Chap10
No ratings yet
I2ml3e Chap10
27 pages
Greenway Poster, Eg 1
No ratings yet
Greenway Poster, Eg 1
1 page
Opamp
No ratings yet
Opamp
11 pages
We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists
No ratings yet
We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists
21 pages
Mastering Metrics Published
No ratings yet
Mastering Metrics Published
4 pages
06 Lectureslides LinearClassification Fixed
No ratings yet
06 Lectureslides LinearClassification Fixed
52 pages
Procedural Writing
100% (1)
Procedural Writing
3 pages
189 Cheat Sheet Minicards
No ratings yet
189 Cheat Sheet Minicards
2 pages
Bayesian Logistic Regression Guide
No ratings yet
Bayesian Logistic Regression Guide
11 pages
ML Question CMU
No ratings yet
ML Question CMU
12 pages
TNCT Q1 COT On Roles of Parts of A Whole
No ratings yet
TNCT Q1 COT On Roles of Parts of A Whole
43 pages
Logistic Regression
No ratings yet
Logistic Regression
19 pages
Summary WRITING Objectives
100% (1)
Summary WRITING Objectives
5 pages
PR2Q1
No ratings yet
PR2Q1
12 pages
Iterative Reweighted Least Squares: Sargur N. Srihari
No ratings yet
Iterative Reweighted Least Squares: Sargur N. Srihari
22 pages
Lps-01-Hti-Itp-Me-024 - Fan Coil Unit
No ratings yet
Lps-01-Hti-Itp-Me-024 - Fan Coil Unit
5 pages
ATTENDANCE 2nd Quarter (AutoRecovered)
No ratings yet
ATTENDANCE 2nd Quarter (AutoRecovered)
3 pages
Semi-Detailed Lesson Plan: Kagawaran NG Edukasyon
No ratings yet
Semi-Detailed Lesson Plan: Kagawaran NG Edukasyon
7 pages
AML AfterMid Merged
No ratings yet
AML AfterMid Merged
389 pages
Scaling Social Impact
No ratings yet
Scaling Social Impact
95 pages
ASTM B 111 B 111M-2009 Standard Specification For Copper and Copper-Alloy Seamless Condenser Tubes and Ferrule Stock
No ratings yet
ASTM B 111 B 111M-2009 Standard Specification For Copper and Copper-Alloy Seamless Condenser Tubes and Ferrule Stock
11 pages
AC-ED L04 - Logistic Regression, Regularization
No ratings yet
AC-ED L04 - Logistic Regression, Regularization
80 pages
Unit Test Integral Calculus Set A
No ratings yet
Unit Test Integral Calculus Set A
4 pages
Well Logging Data Acquisition and Applications Serra Oberto Serra Download
No ratings yet
Well Logging Data Acquisition and Applications Serra Oberto Serra Download
39 pages
Cheatsheet Supervised Learning
100% (1)
Cheatsheet Supervised Learning
4 pages
CS229 Lecture 3 PDF
100% (1)
CS229 Lecture 3 PDF
35 pages
General Guidelines and Procedures Protocolo de Interpretacion
No ratings yet
General Guidelines and Procedures Protocolo de Interpretacion
12 pages
Chapter 4: Linear Models For Classification: Grit Hein & Susanne Leiberg
No ratings yet
Chapter 4: Linear Models For Classification: Grit Hein & Susanne Leiberg
21 pages
Material List Summary-Waptech
No ratings yet
Material List Summary-Waptech
5 pages
Remote Sensing: Radiometric Variations of On-Orbit FORMOSAT-5 RSI From Vicarious and Cross-Calibration Measurements
No ratings yet
Remote Sensing: Radiometric Variations of On-Orbit FORMOSAT-5 RSI From Vicarious and Cross-Calibration Measurements
18 pages
189 Cheat Sheet Nominicards PDF
No ratings yet
189 Cheat Sheet Nominicards PDF
2 pages
04 Probability and Learning PDF
No ratings yet
04 Probability and Learning PDF
34 pages
Organization and Management: Schools Division of Dipolog City Dipolog City Government
No ratings yet
Organization and Management: Schools Division of Dipolog City Dipolog City Government
20 pages
Intro to Logistic Regression
No ratings yet
Intro to Logistic Regression
4 pages
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
No ratings yet
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
10 pages
Machine Learning for Mechanics
No ratings yet
Machine Learning for Mechanics
19 pages
Bayesian and MLE
No ratings yet
Bayesian and MLE
30 pages
Six Sigma Method and 5s Method
No ratings yet
Six Sigma Method and 5s Method
12 pages
Development of Algan/Gan High Electron Mobility Transistors (Hemts) On Diamond Substrates
No ratings yet
Development of Algan/Gan High Electron Mobility Transistors (Hemts) On Diamond Substrates
76 pages
Math Behind Machine Learning
No ratings yet
Math Behind Machine Learning
9 pages
Linear Discriminant Functions Guide
No ratings yet
Linear Discriminant Functions Guide
41 pages
8 Vertical Stresses Below Applied Loads
No ratings yet
8 Vertical Stresses Below Applied Loads
13 pages
South Africa Heart Disease Project: Omar M. Osama Deyaa Eldeen A. Almahallawi June 16, 2010
No ratings yet
South Africa Heart Disease Project: Omar M. Osama Deyaa Eldeen A. Almahallawi June 16, 2010
7 pages
Statistical Learning Theory Guide
No ratings yet
Statistical Learning Theory Guide
4 pages

Probabilistic Discriminative Models

Uploaded by

Probabilistic Discriminative Models

Uploaded by

Probabilistic discriminative models

Linear models for classification

For the two-class classification problem, the posterior probability of class

◮ for a wide choice of class-conditional distributions p(x|Ck )

For the multi-class case, the posterior probability of class Ck

For specific choices of class-conditionals p(x|Ck ), maximum likelihood can be used to

An alternative approach is to use the functional form of the generalised

The indirect approach to find parameters of a generalised linear model, by

In the direct approach, we maximise a likelihood function defined through the

Often, there is significant overlap between class-conditional densities p(x|Ck )

◮ This corresponds to posterior probabilities p(Ck |x), which are not 0 or 1

In such cases, the optimal solution is obtained by modelling the posterior

Note that nonlinear transformations φ(x) cannot remove such class

However, suitable choices of nonlinearity can often make

When considering the two-class problem using a generative approach and

p(C1|φ) = y (φ) = σ(w T φ) with p(C2|φ) = 1

◮ The logistic σ(a)

In the terminology of statistics this model is known as logistic regression

To fit Gaussian class conditional densities with maximum likelihood, we need

For the M parameters of logistic regression model, we use maximum

is written for t = (t1, . . . , t N ) T and yn = p(C1|φn )

By taking the gradient of the error function with respect to w, we

Maximum likelihood can show severe over-fitting for linearly separable

There is also a continuum of such solutions because any separating

One possibility would be to introduce a prior over w and finding a M AP

In the case of the linear regression models, the maximum likelihood

For logistic regression, due to the nonlinearity of the logistic sigmoid

Furthermore, the error function can be minimised by an efficient

We apply the Newton-Raphson method to

2. the cross-entropy error function (logistic regression model)

w (new) = w (old) − (Φ T Φ )− 1 (Φ T Φ w (old) − Φ T t)

which is the classical/standard least-squares solution

Gradient and Hessian of the cross-entropy error function are

where R(w) is a N × N diagonal matrix with (n, n)

Because 0 < yn < 1, for an arbitrary vector u, we have that u T Hu > 0

The Newton-Raphson update formula becomes

wnew = (Φ T RΦ) − 1 Φ T Rz with z = Φw(old) − R−1 (y − t)

Because the weighing matrix R is not constant but depends on the

E[t] = σ(x) = y (14)

The quantity zn (n-th element of z) can then be given an interpretation as an effective

In the discussion of generative models for multiclass classification, we have

exp (ak ) (17)

where the activations ak are

We used maximum likelihood to determine separately the class-conditional

we need the derivatives of yk with respect to all of the activations aj

Taking the negative logarithm gives

E (w1, . . . , w K ) = − ln p(T|w1, . . . , w K ) = − tnk ln (ynk ) (21)

Similarly, for the combination of logistic sigmoid activation function and

To find a batch algorithm, we can use the Newton-Raphson update to

This requires evaluation of the Hessian matrix that comprises

You might also like