@@ -8,58 +8,105 @@ Linear and quadratic discriminant analysis
88
99Linear discriminant analysis (:class: `lda.LDA `) and
1010quadratic discriminant analysis (:class: `qda.QDA `)
11- are two classic classifiers, with, as their names suggest, a linear and a
11+ are two standard classifiers, with, as their names suggest, a linear and a
1212quadratic decision surface, respectively.
1313
1414These classifiers are attractive because they have closed-form solutions that
15- can be easily computed, are inherently multiclass,
16- and have proven to work well in practice.
17- Also there are no parameters to tune for these algorithms.
15+ can be easily computed, are inherently multiclass, have proven to work well in practice and have
16+ no hyperparameters to tune.
1817
1918.. |ldaqda | image :: ../auto_examples/classification/images/plot_lda_qda_001.png
2019 :target: ../auto_examples/classification/plot_lda_qda.html
2120 :scale: 80
2221
2322.. centered :: |ldaqda|
2423
25- The plot shows decision boundaries for LDA and QDA. The bottom row
26- demonstrates that LDA can only learn linear boundaries, while QDA can learn
24+ The plot shows decision boundaries for LDA and QDA. The first row shows that,
25+ when the classes covariances are the same, LDA and QDA yield the same result
26+ (up to a small difference resulting from the implementation). The bottom row demonstrates that in general,
27+ LDA can only learn linear boundaries, while QDA can learn
2728quadratic boundaries and is therefore more flexible.
2829
2930.. topic :: Examples:
3031
3132 :ref: `example_classification_plot_lda_qda.py `: Comparison of LDA and QDA on synthetic data.
3233
33-
3434Dimensionality reduction using LDA
3535==================================
3636
37- :class: `lda.LDA ` can be used to perform supervised dimensionality reduction by
38- projecting the input data to a subspace consisting of the most
39- discriminant directions.
37+ :class: `lda.LDA ` can be used to perform supervised dimensionality reduction, by
38+ projecting the input data to a linear subspace consisting of the directions which maximize the
39+ separation between classes (in a precise sense discussed in the mathematics section below).
40+ The dimension of the output is necessarily less that the number of classes,
41+ so this is a in general a rather strong dimensionality reduction, and only makes senses
42+ in a multiclass setting.
43+
4044This is implemented in :func: `lda.LDA.transform `. The desired
4145dimensionality can be set using the ``n_components `` constructor
4246parameter. This parameter has no influence on :func: `lda.LDA.fit ` or :func: `lda.LDA.predict `.
4347
48+ .. topic :: Examples:
49+
50+ :ref: `example_decomposition_plot_pca_vs_lda.py `: Comparison of LDA and PCA for dimensionality reduction of the Iris dataset
51+
52+ Mathematical formulation of the LDA and QDA classifiers
53+ =======================================================
54+
55+ Both LDA and QDA can be derived from simple probabilistic models
56+ which model the class conditional distribution of the data :math: `P(X|y=k)`
57+ for each class :math: `k`. Predictions can then be obtained by using Bayes' rule:
58+
59+ .. math ::
60+ P(y=k | X) = \frac {P(X | y=k) P(y=k)}{P(X)} = \frac {P(X | y=k) P(y = k)}{ \sum _{l} P(X | y=l) \cdot P(y=l)}
61+
62+ and we select the class :math: `k` which maximizes this conditional probability.
63+
64+ More specifically, for linear and quadratic discriminant analysis, :math: `P(X|y)`
65+ is modelled as a multivariate Gaussian distribution with density:
4466
45- Mathematical Idea
46- =================
67+ .. math :: p(X | y=k) = \frac{1}{(2\pi)^n |\Sigma_k|^{1/2}}\exp\left(-\frac{1}{2} (X-\mu_k)^t \Sigma_k^{-1} (X-\mu_k)\right)
4768
48- Both methods work by modeling the class conditional distribution of the data :math: `P(X|y=k)`
49- for each class :math: `k`. Predictions can be obtained by using Bayes' rule:
69+ To use this model as a classifier, we just need to estimate from the training data
70+ the class priors :math: `P(y=k)` (by the proportion of instances of class :math: `k`), the
71+ class means :math: `\mu _k` (by the empirical sample class means) and the covariance matrices
72+ (either by the empirical sample class covariance matrices, or by a regularized estimator: see the section on shrinkage below).
73+
74+ In the case of LDA, the Gaussians for each class are assumed
75+ to share the same covariance matrix: :math: `\Sigma _k = \Sigma ` for all :math: `k`.
76+ This leads to linear decision surfaces between, as can be seen by comparing the the log-probability ratios
77+ :math: `\log [P(y=k | X) / P(y=l | X)]`:
5078
5179.. math ::
52- P(y | X) = P(X | y) \cdot P(y) / P(X) = P(X | y) \cdot P(Y) / ( \sum _{y'} P(X | y') \cdot p(y'))
80+ \log \left (\frac {P(y=k|X)}{P(y=l | X)}\right ) = 0 \Leftrightarrow (\mu _k-\mu _l)\Sigma ^{-1 } X = \frac {1 }{2 } (\mu _k^t \Sigma ^{-1 } \mu _k - \mu _l^t \Sigma ^{-1 } \mu _l)
81+
82+ In the case of QDA, there are no assumptions on the covariance matrices :math: `\Sigma _k` of the Gaussians,
83+ leading to quadratic decision surfaces. See [#1 ]_ for more details.
84+
85+ .. note :: **Relation with Gaussian Naive Bayes**
86+
87+ If in the QDA model one assumes that the covariance matrices are diagonal, then
88+ this means that we assume the classes are conditionally independent,
89+ and the resulting classifier is equivalent to the Gaussian Naive Bayes classifier :class: `GaussianNB `.
90+
91+ Mathematical formulation of LDA dimensionality reduction
92+ ===========================================================
93+
94+ To understand the use of LDA in dimensionality reduction, it is useful to start
95+ with a geometric reformulation of the LDA classification rule explained above.
96+ We write :math: `K` for the total number of target classes.
97+ Since in LDA we assume that all classes have the same estimated covariance :math: `\Sigma `, we can rescale the
98+ data so that this covariance is the identity:
5399
54- In linear and quadratic discriminant analysis, :math: `P(X|y)`
55- is modelled as a Gaussian distribution.
56- In the case of LDA, the Gaussians for each class are assumed to share the same covariance matrix.
57- This leads to a linear decision surface, as can be seen by comparing the the log-probability rations
58- :math: `log[P(y=k | X) / P(y=l | X)]`.
100+ .. math :: X^* = D^{-1/2}U^t X\text{ with }\Sigma = UDU^t
59101
60- In the case of QDA, there are no assumptions on the covariance matrices of the Gaussians,
61- leading to a quadratic decision surface.
102+ Then one can show that to classify a data point after scaling is equivalent to finding the estimated class mean :math: `\mu ^*_k` which is
103+ closest to the data point in the Euclidean distance. But this can be done just as well after projecting on the :math: `K-1 ` affine subspace :math: `H_K`
104+ generated by all the :math: `\mu ^*_k` for all classes. This shows that, implicit in the LDA classifier, there is
105+ a dimensionality reduction by linear projection onto a :math: `K-1 ` dimensional space.
62106
107+ We can reduce the dimension even more, to a chosen :math: `L`, by projecting onto the linear subspace :math: `H_L` which
108+ maximize the variance of the :math: `\mu ^*_k` after projection (in effect, we are doing a form of PCA for the transformed class means :math: `\mu ^*_k`).
109+ This :math: `L` corresponds to the ``n_components `` parameter in the :func: `lda.LDA.transform ` method. See [#1 ]_ for more details.
63110
64111Shrinkage
65112=========
@@ -70,7 +117,7 @@ features. In this scenario, the empirical sample covariance is a poor
70117estimator. Shrinkage LDA can be used by setting the ``shrinkage `` parameter of
71118the :class: `lda.LDA ` class to 'auto'. This automatically determines the
72119optimal shrinkage parameter in an analytic way following the lemma introduced
73- by Ledoit and Wolf. Note that currently shrinkage only works when setting the
120+ by Ledoit and Wolf [ #2 ]_ . Note that currently shrinkage only works when setting the
74121``solver `` parameter to 'lsqr' or 'eigen'.
75122
76123The ``shrinkage `` parameter can also be manually set between 0 and 1. In
@@ -111,7 +158,8 @@ a high number of features.
111158
112159.. topic :: References:
113160
114- Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning. Springer, 2009.
161+ .. [#1 ] "The Elements of Statistical Learning", Hastie T., Tibshirani R.,
162+ Friedman J., Section 4.3, p.106-119, 2008.
115163
116- Ledoit O, Wolf M. Honey, I Shrunk the Sample Covariance Matrix. The Journal of Portfolio
164+ .. [ #2 ] Ledoit O, Wolf M. Honey, I Shrunk the Sample Covariance Matrix. The Journal of Portfolio
117165 Management 30(4), 110-119, 2004.
0 commit comments