Probabilistic discriminative models
Linear models for classification
FC - Fortaleza
Probabilistic discriminative models
Linear models for classification
Probabilistic discriminative models
For the two-class classification problem, the posterior probability of class
C1 can be written as a logistic sigmoid acting on a linear function of x
p(x|C1)p(C1) T
p(C1|x) = σ ln 2 2 = σ(w x + w0 )
p(x|C )p(C )
◮ for a wide choice of class-conditional distributions p(x|Ck )
For the multi-class case, the posterior probability of class Ck
is given by a softmax transformation of a linear function
of x
p(x|Ck )p(Ck ) exp (wkT x + wk0 )
p(Ck |x) = Σ K = Σ K
j p(x|Cj )p(Cj ) j exp (w Tj x + wj 0)
=1 =1
Probabilistic discriminative models (cont.)
For specific choices of class-conditionals p(x|Ck ), maximum likelihood can be used to
determine the parameters of the densities and the class priors p(Ck )
◮ Bayes’ theorem is then used to find posterior class probabilities p(Ck |x)
An alternative approach is to use the functional form of the generalised
linear model explicitly and determine its parameters directly by maximum
likelihood
◮ There is an efficient algorithm finding such solutions
◮ Iterative re-weighted least squares, IRLS
Probabilistic discriminative models (cont.)
The indirect approach to find parameters of a generalised linear model, by
fitting class-conditional densities and class priors separately and then by
applying Bayes’ theorem, represents an example of generative modelling
◮ We could take such a model and generate synthetic data
by drawing values of x from the marginal distribution p(x)
In the direct approach, we maximise a likelihood function defined through the
conditional distribution p(Ck |x), this is a form of discriminative training
◮ One advantage of the discriminative approach is that there
will typically be fewer adaptive parameters to be determined
◮ It may also lead to improved predictive performance, particularly
when the class-conditional density assumptions give a poor
approximation to the true distributions
Fixed basis functions
Probabilistic discriminative models
Fixed basis functions
We considered classification models that work with the original input vector x
However, all of the algorithms are equally applicable if we first make a fixed
nonlinear transformation of the inputs using a vector of basis functions φ(x)
The resulting decision boundaries will be linear in the feature space φ, and
these correspond to nonlinear decision boundaries in the original x space
◮ Classes that are linearly separable in the feature space φ(x) need
not be linearly separable in the original observation space x
Fixed basis functions (cont.)
1
1
φ2
x2
0 0.5
−1
0
−1 0 1 0 0.5 1
x1 φ1
Original input space (x1 , x2 ) together with points from two classes (red/blue)
◮ Two ‘Gaussian’ basis functions φ1 (x) and φ2(x) are defined in this space
with centres (green crosses) and with contours (green circles)
Feature space (φ1, φ2) together with the linear decision boundary (black line)
◮ Nonlinear decision boundary in the original input space (black curve)
Fixed basis functions (cont.)
Often, there is significant overlap between class-conditional densities p(x|Ck )
◮ This corresponds to posterior probabilities p(Ck |x), which are not 0 or 1
◮ At least, for some values of x
In such cases, the optimal solution is obtained by modelling the posterior
probabilities p(Ck |x) accurately and then applying standard decision theory
Note that nonlinear transformations φ(x) cannot remove such class
overlap
◮ Indeed, they can increase the level of overlap, or even create
overlap where none existed in the original observation space
However, suitable choices of nonlinearity can often make
the process of modelling the posterior probabilities easier
Notwithstanding these limitations, models with fixed
Logistic regression
Probabilistic discriminative models
Logistic regression
When considering the two-class problem using a generative approach and
under general assumptions, the posterior probability of class C1 can be written as
◮ a logistic sigmoid on a linear function of the feature vector φ so
that
p(C1|φ) = y (φ) = σ(w T φ) with p(C2|φ) = 1
− p(C1|φ) (1)
◮ The logistic σ(a)
sigmoid
= function
1 is defined
with a =as p(x|C1)p(C1)
1 + exp (−a) ln p(x|C2)p(C2)
In the terminology of statistics this model is known as logistic regression
◮ For an M-dimensional feature space φ, the model has M parameters
Logistic regression (cont.)
To fit Gaussian class conditional densities with maximum likelihood, we need
◮ 2M + M(M + 1)/2 parameters for means and (shared) covariance matrix
And a total of M ( M + 5)/2 + 1 parameters, if we include the class prior p(C1)
◮ The number of parameters grows quadratically with M
For the M parameters of logistic regression model, we use maximum
likelihood to determine the parameters
Logistic regression (cont.)
For data {φn, tn } N with tn = {0, 1 } and φn = φ(xn ), the likelihood function
n=1
p(t|w) = Y ytn (1 − yn 1− tn
N
(2)
n
) n=1
is written for t = (t1, . . . , t N ) T and yn = p(C1|φn )
By taking the negative log of the likelihood, our error function is defined
by N Σ
(3)
E (w) = − ln p(t|w) = − t n ln (yn ) + (1 − t n ) ln (1 −
n=1
yn )
which is the cross-entropy error function with yn = σ(an) and an = w T φn
Logistic regression (cont.)
By taking the gradient of the error function with respect to w, we
get
(4)
∇E (w) = ΣN (yn − t n )φ n
n=1
The contribution to the gradient from point n comes from the error (yn − t n ) between
target value and model prediction, times the basis function vector φn
◮ The gradient takes the same form as the gradient of
the sum-of-squares error function for linear
regression models
Logistic regression (cont.)
P robabilistic discriminative
models
Maximum likelihood can show severe over-fitting for linearly separable
datasets
◮ The MLE solution occurs when the hyperplane for σ = 0.5, or w T φ
= 0, separates the two classes and the magnitude of w goes to
infinity
◮ The logistic sigmoid becomes infinitely steep (Heaviside) in feature
space, and every point from each class k gets a posterior probability p(Ck |x) =
1
There is also a continuum of such solutions because any separating
hyperplane gives rise to the same posterior probabilities at the
training data points
◮ Maximum likelihood does not favour one such solution over another
◮ The solution depends on the optimisation algorithm and initialisation
One possibility would be to introduce a prior over w and finding a M AP
solution
◮ Add a regularisation term to the error function
Iterative reweighted least squares
Probabilistic discriminative models
Iterative reweighted least squares
In the case of the linear regression models, the maximum likelihood
solution, on the assumption of a Gaussian noise model, leads to a closed-
form solution
◮ A consequence of quadratic dependence of log likelihood function on w
For logistic regression, due to the nonlinearity of the logistic sigmoid
function
◮ There is no longer a closed-form solution
◮ Departure from quadratic is not substantial
Specifically, the error function is concave, and hence it has a unique
minimum
Furthermore, the error function can be minimised by an efficient
iterative technique based on the Newton-Raphson iterative
optimisation scheme
◮ A local quadratic approximation to the log likelihood function
Iterative reweighted least squares (cont.)
The Newton-Raphson update, for minimising a function E (w), takes the form
−1
w (new) = w (old) − H ∇E (w)
(5)
H is the Hessian matrix, with elements the second derivatives of E (w) wrt w
We apply the Newton-Raphson method to
1. the sum-of-squares error function (linear regression model)
1 ΣN 2
ED (w) = 2 t n − w T φ(xn )
n=1
2. the cross-entropy error function (logistic regression model)
ΣN
E (w) = − ln p(t|w) =
− t n ln (yn ) + (1 − t n ) ln (1 −
yn )
n=1
Iterative reweighted least squares (cont.)
Gradient and Hessian of the sum-of-squares error function are
ΣN
∇E (w) (6)
= (w T φ n − t n )φ n = Φ T Φ w − Φ T t
n=1
ΣN
H = ∇∇E (w) (7)
= φ nφ T = Φ T Φ
n=1
where Φ is the N × M design matrix with nφ T in the n-th
row
The Newton-Raphson update takes the form
w (new) = w (old) − (Φ T Φ )− 1 (Φ T Φ w (old) − Φ T t)
= (Φ T Φ )− 1 Φ T t (8)
which is the classical/standard least-squares solution
The error function is quadratic, N-R formula gets the exact solution in one step
Iterative reweighted least squares (cont.)
Gradient and Hessian of the cross-entropy error function are
ΣN
∇E (w) (9)
= (yn − t n )φ n = Φ T (y − t)
n=1
(10)
∇∇E (w) = Σ
N
H = yn (1 − ynφ nφnT ) = Φ T RΦ
n=1
where R(w) is a N × N diagonal matrix with (n, n)
elements
(11)
Rnn = yn (1 −
yn )
The Hessian is no longer constant, depends on w through weighting matrix R
Iterative reweighted least squares (cont.)
Because 0 < yn < 1, for an arbitrary vector u, we have that u T Hu > 0
◮ H is positive definite
The error function is concave in w and hence it has a unique minimum
The Newton-Raphson update formula becomes
= w (old) − (Φ T RΦ )− 1Φ T (y − t)
w(new)
= (Φ T RΦ )− 1 Φ T RΦ w (old) −
Φ T (y − t)
(12)
= (Φ T RΦ )− 1 Φ T Rz
where z is a N-vector with
elements
z = Φ w (old) − R− 1 (y − (13)
t)
Iterative reweighted least squares
(cont.)
wnew = (Φ T RΦ) − 1 Φ T Rz with z = Φw(old) − R−1 (y − t)
The update is the set of normal equations for a weighted least-squares problem
Because the weighing matrix R is not constant but depends on the
parameter vector w, we must apply the normal equations iteratively
◮ each time using the new weight vector w to compute revised weights R
For this reason, the algorithm is iterative reweighted least squares, or IRLS
Iterative reweighted least squares (cont.)
As in weighted least-squares problems, the elements of the diagonal weighting
matrix R can be interpreted as variances because the mean and variance of t
(t 2 = t , for t ∈ { 0, 1} ) in the logistic regression model are
E[t] = σ(x) = y (14)
var[t ] = E[t 2 ] − E[t ]2 = σ(x) − σ(x)2 = y (1 (15)
−y)
We interpret IRLS as solution to a linearised problem in the space of a = w T φ
The quantity zn (n-th element of z) can then be given an interpretation as an effective
target value in this space by making a local linear approximation to the
logistic sigmoid function around the current operating point w(old)
dan..
an (w) ≃ anw(old) +
. (t n −
dyn w(old)
T (old) (yynn−)
= φ
n w − t n) = zn (16)
yn (1 − y
) n
Multiclass logistic regression
Probabilistic discriminative models
Multiclass logistic regression
In the discussion of generative models for multiclass classification, we have
seen that for a large class of distributions, the posterior probabilities are
given by a softmax transformation of linear functions of feature variables
exp (ak ) (17)
p(Ck |φ) = yk (φ) = Σ
j exp (aj )
where the activations ak are
(18)
ak = w Tk φ
We used maximum likelihood to determine separately the class-conditional
densities and the class priors and then found the corresponding posterior
probabilities using Bayes’ theorem, implicitly determining parameters {w k }
Multiclass logistic regression (cont.)
We can use maximum likelihood to get parameters {w k } of this model directly To do this,
we need the derivatives of yk with respect to all of the activations aj
∂yk = yk (Ikj − y )
∂aj
(19)j
where Ikj are the elements of the identity matrix
Next we need to write the likelihood function using the 1-of-K coding scheme
◮ The target vector t n for feature vector φn belonging to class Ck
is a binary vector with all elements zero except for element k
The likelihood is then given by
Y
N
Y
K
Multiclass logistic regression (cont.)
Taking the negative logarithm gives
ΣN ΣK
E (w1, . . . , w K ) = − ln p(T|w1, . . . , w K ) = − tnk ln (ynk ) (21)
n=1 k=1
This is the cross-entropy error function for the multiclass classification problem
We now take the gradient of the error function wrt to one parameter vector wj
ΣN
(22)
∇wj E (w 1, . . . , w K ) = (ynj −
t nj )φ n ∂yk
We used the result for derivatives of the n=1 kj −
softmax function, = yk (I
∂aj
y) 1 j
Σ
1
We used also t = 1
k nk
Multiclass logistic regression (cont.)
The same form for the gradient as found for the sum-of-squares error function
with the linear model and the cross-entropy error for logistic regression model
◮ T he product of the error (ynj − t nj ) times the basis function φ n
The derivative of the log likelihood function for a linear regression model with
respect to the parameter vector w for a data point n took the same form
◮ T he error (yn − t n ) times the feature vector φ n
Similarly, for the combination of logistic sigmoid activation function and
cross-entropy error function, and for the softmax activation function with the
multiclass cross-entropy error function, we again obtain this same simple
form
Multiclass logistic regression (cont.)
To find a batch algorithm, we can use the Newton-Raphson update to
obtain the corresponding IRLS algorithm for the multiclass problem
This requires evaluation of the Hessian matrix that comprises
blocks of size M × M in which block (j , k ) is given by
ΣN
∇w kj∇w E (w1, . . . , wk ) = − ynk (Ikj − ynj )φ n φn T (23)
n=1
As with two-classes, the Hessian matrix for the multiclass logistic regression
models is positive definite and the error function has a unique minimum