Skip to content

Commit df14ef9

Browse files
simon-pepinamueller
authored andcommitted
Rewrite of the documentation for LDA/QDA
1 parent 9dafcb1 commit df14ef9

File tree

1 file changed

+73
-25
lines changed

1 file changed

+73
-25
lines changed

doc/modules/lda_qda.rst

Lines changed: 73 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -8,58 +8,105 @@ Linear and quadratic discriminant analysis
88

99
Linear discriminant analysis (:class:`lda.LDA`) and
1010
quadratic discriminant analysis (:class:`qda.QDA`)
11-
are two classic classifiers, with, as their names suggest, a linear and a
11+
are two basic classifiers, with, as their names suggest, a linear and a
1212
quadratic decision surface, respectively.
1313

1414
These classifiers are attractive because they have closed-form solutions that
15-
can be easily computed, are inherently multiclass,
16-
and have proven to work well in practice.
17-
Also there are no parameters to tune for these algorithms.
15+
can be easily computed, are inherently multiclass, have proven to work well in practice and have
16+
no hyperparameters to tune.
1817

1918
.. |ldaqda| image:: ../auto_examples/classification/images/plot_lda_qda_001.png
2019
:target: ../auto_examples/classification/plot_lda_qda.html
2120
:scale: 80
2221

2322
.. centered:: |ldaqda|
2423

25-
The plot shows decision boundaries for LDA and QDA. The bottom row
26-
demonstrates that LDA can only learn linear boundaries, while QDA can learn
24+
The plot shows decision boundaries for LDA and QDA. The first row shows that,
25+
when the classes covariances are the same, LDA and QDA yield the same result
26+
(up to a small difference resulting from the implementation). The bottom row demonstrates that in general,
27+
LDA can only learn linear boundaries, while QDA can learn
2728
quadratic boundaries and is therefore more flexible.
2829

2930
.. topic:: Examples:
3031

3132
:ref:`example_classification_plot_lda_qda.py`: Comparison of LDA and QDA on synthetic data.
3233

33-
3434
Dimensionality reduction using LDA
3535
==================================
3636

37-
:class:`lda.LDA` can be used to perform supervised dimensionality reduction by
38-
projecting the input data to a subspace consisting of the most
39-
discriminant directions.
37+
:class:`lda.LDA` can be used to perform supervised dimensionality reduction, by
38+
projecting the input data to a linear subspace consisting of the directions which maximize the
39+
separation between classes (in a precise sense discussed in the mathematics section below).
40+
The dimension of the output is necessarily less that the number of classes,
41+
so this is a in general a rather strong dimensionality reduction, and only makes senses
42+
in a multiclass setting.
43+
4044
This is implemented in :func:`lda.LDA.transform`. The desired
4145
dimensionality can be set using the ``n_components`` constructor
4246
parameter. This parameter has no influence on :func:`lda.LDA.fit` or :func:`lda.LDA.predict`.
4347

48+
.. topic:: Examples:
49+
50+
:ref:`example_decomposition_plot_pca_vs_lda.py`: Comparison of LDA and PCA for dimensionality reduction of the Iris dataset
51+
52+
Mathematical formulation of the LDA and QDA classifiers
53+
======================
54+
55+
Both LDA and QDA can be derived from simple probabilistic models
56+
which model the class conditional distribution of the data :math:`P(X|y=k)`
57+
for each class :math:`k`. Predictions can then be obtained by using Bayes' rule:
58+
59+
.. math::
60+
P(y=k | X) = \frac{P(X | y=k) P(y=k)}{P(X)} = \frac{P(X | y=k) P(Y)}{ \sum_{l} P(X | y=l) \cdot P(y=l)}
61+
62+
and we select the class :math:`k` which maximizes this conditional probability.
63+
64+
More specifically, for linear and quadratic discriminant analysis, :math:`P(X|y)`
65+
is modelled as a multivariate Gaussian distribution with density:
4466

45-
Mathematical Idea
46-
=================
67+
.. math:: p(X | y=k) = \frac{1}{(2\pi)^n |\Sigma_k|^{1/2}}\exp\left(-\frac{1}{2} (X-\mu_k)^t \Sigma_k^{-1} (X-\mu_k)\right)
4768

48-
Both methods work by modeling the class conditional distribution of the data :math:`P(X|y=k)`
49-
for each class :math:`k`. Predictions can be obtained by using Bayes' rule:
69+
To use this model as a classifier, we just need to estimate from the training data
70+
the class priors :math:`P(y=k)` (by the proportion of instances of class :math:`k`), the
71+
class means :math:`\mu_k` (by the empirical sample class means) and the covariance matrices
72+
(either by the empirical sample class covariance matrices, or by a regularized estimator: see the section on shrinkage below).
73+
74+
In the case of LDA, the Gaussians for each class are assumed
75+
to share the same covariance matrix: :math:`\Sigma_k = \Sigma` for all :math:`k`.
76+
This leads to linear decision surfaces between, as can be seen by comparing the the log-probability ratios
77+
:math:`\log[P(y=k | X) / P(y=l | X)]`:
5078

5179
.. math::
52-
P(y | X) = P(X | y) \cdot P(y) / P(X) = P(X | y) \cdot P(Y) / ( \sum_{y'} P(X | y') \cdot p(y'))
80+
\log\left(\frac{P(y=k|X)}{P(y=l | X)}\right) = 0 \Leftrightarrow (\mu_k-\mu_l)\Sigma^{-1} X = \frac{1}{2} (\mu_k^t \Sigma^{-1} \mu_k - \mu_l^t \Sigma^{-1} \mu_l)
81+
82+
In the case of QDA, there are no assumptions on the covariance matrices :math:`\Sigma_k` of the Gaussians,
83+
leading to quadratic decision surfaces. See [#1]_ for more details.
84+
85+
.. note:: **Relation with Gaussian Naive Bayes**
86+
87+
If in the QDA model one assumes that the covariance matrices are diagonal, then
88+
this means that we assume the classes are conditionally independent,
89+
and the resulting classifier is equivalent to the Gaussian Naive Bayes classifier :class:`GaussianNB`.
90+
91+
Mathematical formulation of LDA dimensionality reduction
92+
===========================================================
93+
94+
To understand the use of LDA in dimensionality reduction, it is useful to start
95+
with a geometric reformulation of the LDA classification rule explained above.
96+
We write :math:`K` for the total number of target classes.
97+
Since in LDA we assume that all classes have the same estimated covariance :math:`\Sigma`, we can rescale the
98+
data so that this covariance is the identity:
5399

54-
In linear and quadratic discriminant analysis, :math:`P(X|y)`
55-
is modelled as a Gaussian distribution.
56-
In the case of LDA, the Gaussians for each class are assumed to share the same covariance matrix.
57-
This leads to a linear decision surface, as can be seen by comparing the the log-probability rations
58-
:math:`log[P(y=k | X) / P(y=l | X)]`.
100+
.. math:: X^* = D^{-1/2}U^t X\text{ with }\Sigma = UDU^t
59101

60-
In the case of QDA, there are no assumptions on the covariance matrices of the Gaussians,
61-
leading to a quadratic decision surface.
102+
Then one can show that to classify a data point after scaling is equivalent to finding the estimated class mean :math:`\mu^*_k` which is
103+
closest to the data point in the Euclidean distance. But this can be done just as well after projecting on the :math:`K-1` affine subspace :math:`H_K`
104+
generated by all the :math:`\mu^*_k` for all classes. This shows that, implicit in the LDA classifier, there is
105+
a dimensionality reduction by linear projection onto a :math:`K-1` dimensional space.
62106

107+
We can reduce the dimension even more, to a chosen :math:`L`, by projecting onto the linear subspace :math:`H_L` which
108+
maximize the variance of the :math:`\mu^*_k` after projection (in effect, we are doing a form of PCA for the transformed class means :math:`\mu^*_k`).
109+
This :math:`L` corresponds to the ``n_components`` parameter in the :func:`lda.LDA.transform` method. See [#1]_ for more details.
63110

64111
Shrinkage
65112
=========
@@ -70,7 +117,7 @@ features. In this scenario, the empirical sample covariance is a poor
70117
estimator. Shrinkage LDA can be used by setting the ``shrinkage`` parameter of
71118
the :class:`lda.LDA` class to 'auto'. This automatically determines the
72119
optimal shrinkage parameter in an analytic way following the lemma introduced
73-
by Ledoit and Wolf. Note that currently shrinkage only works when setting the
120+
by Ledoit and Wolf [#2]_. Note that currently shrinkage only works when setting the
74121
``solver`` parameter to 'lsqr' or 'eigen'.
75122

76123
The ``shrinkage`` parameter can also be manually set between 0 and 1. In
@@ -111,7 +158,8 @@ a high number of features.
111158

112159
.. topic:: References:
113160

114-
Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning. Springer, 2009.
161+
.. [#1] "The Elements of Statistical Learning", Hastie T., Tibshirani R.,
162+
Friedman J., Section 4.3, p.106-119, 2008.
115163
116-
Ledoit O, Wolf M. Honey, I Shrunk the Sample Covariance Matrix. The Journal of Portfolio
164+
.. [#2] Ledoit O, Wolf M. Honey, I Shrunk the Sample Covariance Matrix. The Journal of Portfolio
117165
Management 30(4), 110-119, 2004.

0 commit comments

Comments
 (0)