Skip to content

Commit 535536e

Browse files
committed
merge with upstream/master
2 parents 7d4e8e1 + 11d33bc commit 535536e

File tree

88 files changed

+16769
-8305
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

88 files changed

+16769
-8305
lines changed

.gitattributes

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
/sklearn/cluster/_hierarchical.cpp -diff
55
/sklearn/cluster/_k_means.c -diff
66
/sklearn/datasets/_svmlight_format.c -diff
7+
/sklearn/decomposition/_online_lda.c -diff
78
/sklearn/ensemble/_gradient_boosting.c -diff
89
/sklearn/feature_extraction/_hashing.c -diff
910
/sklearn/linear_model/cd_fast.c -diff

COPYING

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
New BSD License
22

3-
Copyright (c) 2007–2014 The scikit-learn developers.
3+
Copyright (c) 2007–2015 The scikit-learn developers.
44
All rights reserved.
55

66

benchmarks/bench_multilabel_metrics.py

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -81,11 +81,9 @@ def benchmark(metrics=tuple(v for k, v in sorted(METRICS.items())),
8181
for i, (s, c, d) in enumerate(it):
8282
_, y_true = make_multilabel_classification(n_samples=s, n_features=1,
8383
n_classes=c, n_labels=d * c,
84-
return_indicator=True,
8584
random_state=42)
8685
_, y_pred = make_multilabel_classification(n_samples=s, n_features=1,
8786
n_classes=c, n_labels=d * c,
88-
return_indicator=True,
8987
random_state=84)
9088
for j, f in enumerate(formats):
9189
f_true = f(y_true)

benchmarks/bench_plot_approximate_neighbors.py

Lines changed: 7 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -125,20 +125,16 @@ def calc_accuracy(X, queries, n_queries, n_neighbors, exact_neighbors,
125125

126126
# Set labels for LSHForest parameters
127127
colors = ['c', 'm', 'y']
128-
p1 = plt.Rectangle((0, 0), 0.1, 0.1, fc=colors[0])
129-
p2 = plt.Rectangle((0, 0), 0.1, 0.1, fc=colors[1])
130-
p3 = plt.Rectangle((0, 0), 0.1, 0.1, fc=colors[2])
128+
legend_rects = [plt.Rectangle((0, 0), 0.1, 0.1, fc=color)
129+
for color in colors]
131130

132-
labels = ['n_estimators=' + str(params_list[0]['n_estimators']) +
133-
', n_candidates=' + str(params_list[0]['n_candidates']),
134-
'n_estimators=' + str(params_list[1]['n_estimators']) +
135-
', n_candidates=' + str(params_list[1]['n_candidates']),
136-
'n_estimators=' + str(params_list[2]['n_estimators']) +
137-
', n_candidates=' + str(params_list[2]['n_candidates'])]
131+
legend_labels = ['n_estimators={n_estimators}, '
132+
'n_candidates={n_candidates}'.format(**p)
133+
for p in params_list]
138134

139135
# Plot precision
140136
plt.figure()
141-
plt.legend((p1, p2, p3), (labels[0], labels[1], labels[2]),
137+
plt.legend(legend_rects, legend_labels,
142138
loc='upper left')
143139

144140
for i in range(len(params_list)):
@@ -154,7 +150,7 @@ def calc_accuracy(X, queries, n_queries, n_neighbors, exact_neighbors,
154150

155151
# Plot speed up
156152
plt.figure()
157-
plt.legend((p1, p2, p3), (labels[0], labels[1], labels[2]),
153+
plt.legend(legend_rects, legend_labels,
158154
loc='upper left')
159155

160156
for i in range(len(params_list)):

doc/images/lda_model_graph.png

18.6 KB
Loading

doc/modules/classes.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -288,6 +288,7 @@ Samples generator
288288
decomposition.SparseCoder
289289
decomposition.DictionaryLearning
290290
decomposition.MiniBatchDictionaryLearning
291+
decomposition.LatentDirichletAllocation
291292

292293
.. autosummary::
293294
:toctree: generated/
@@ -1104,6 +1105,7 @@ See the :ref:`metrics` section of the user guide for further details.
11041105
:template: class.rst
11051106

11061107
preprocessing.Binarizer
1108+
preprocessing.FunctionTransformer
11071109
preprocessing.Imputer
11081110
preprocessing.KernelCenterer
11091111
preprocessing.LabelBinarizer

doc/modules/decomposition.rst

Lines changed: 84 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -706,7 +706,7 @@ the data.
706706
.. topic:: Examples:
707707

708708
* :ref:`example_decomposition_plot_faces_decomposition.py`
709-
* :ref:`example_applications_topics_extraction_with_nmf.py`
709+
* :ref:`example_applications_topics_extraction_with_nmf_lda.py`
710710

711711
.. topic:: References:
712712

@@ -726,3 +726,86 @@ the data.
726726
matrix factorization"
727727
<http://scgroup.hpclab.ceid.upatras.gr/faculty/stratis/Papers/HPCLAB020107.pdf>`_
728728
C. Boutsidis, E. Gallopoulos, 2008
729+
730+
731+
.. _LatentDirichletAllocation:
732+
733+
Latent Dirichlet Allocation (LDA)
734+
=================================
735+
736+
Latent Dirichlet Allocation is a generative probabilistic model for collections of
737+
discrete dataset such as text corpora. It is also a topic model that is used for
738+
discovering abstract topics from a collection of documents.
739+
740+
The graphical model of LDA is a three-level Bayesian model:
741+
742+
.. image:: ../images/lda_model_graph.png
743+
:align: center
744+
745+
When modeling text corpora, the model assumes the following generative process for
746+
a corpus with :math:`D` documents and :math:`K` topics:
747+
748+
1. For each topic :math:`k`, draw :math:`\beta_k \sim Dirichlet(\eta),\: k =1...K`
749+
750+
2. For each document :math:`d`, draw :math:`\theta_d \sim Dirichlet(\alpha), \: d=1...D`
751+
752+
3. For each word :math:`i` in document :math:`d`:
753+
a. Draw a topic index :math:`z_{di} \sim Multinomial(\theta_d)`
754+
b. Draw the observed word :math:`w_{ij} \sim Multinomial(beta_{z_{di}}.)`
755+
756+
For parameter estimation, the posterior distribution is:
757+
758+
.. math::
759+
p(z, \theta, \beta |w, \alpha, \eta) =
760+
\frac{p(z, \theta, \beta|\alpha, \eta)}{p(w|\alpha, \eta)}
761+
762+
Since the posterior is intractable, variational Bayesian method
763+
uses a simpler distribution :math:`q(z,\theta,\beta | \lambda, \phi, \gamma)`
764+
to approximate it, and those variational parameters :math:`\lambda`, :math:`\phi`,
765+
:math:`\gamma` are optimized to maximize the Evidence Lower Bound (ELBO):
766+
767+
.. math::
768+
log\: P(w | \alpha, \eta) \geq L(w,\phi,\gamma,\lambda) \overset{\triangle}{=}
769+
E_{q}[log\:p(w,z,\theta,\beta|\alpha,\eta)] - E_{q}[log\:q(z, \theta, \beta)]
770+
771+
Maximizing ELBO is equivalent to minimizing the Kullback-Leibler(KL) divergence
772+
between :math:`q(z,\theta,\beta)` and the true posterior
773+
:math:`p(z, \theta, \beta |w, \alpha, \eta)`.
774+
775+
:class:`LatentDirichletAllocation` implements online variational Bayes algorithm and supports
776+
both online and batch update method.
777+
While batch method updates variational variables after each full pass through the data,
778+
online method updates variational variables from mini-batch data points. Therefore,
779+
online method usually converges faster than batch method.
780+
781+
.. note::
782+
783+
Although online method is guaranteed to converge to a local optimum point, the quality of
784+
the optimum point and the speed of convergence may depend on mini-batch size and
785+
attributes related to learning rate setting.
786+
787+
When :class:`LatentDirichletAllocation` is applied on a "document-term" matrix, the matrix
788+
will be decomposed into a "topic-term" matrix and a "document-topic" matrix. While
789+
"topic-term" matrix is stored as :attr:`components_` in the model, "document-topic" matrix
790+
can be calculated from ``transform`` method.
791+
792+
:class:`LatentDirichletAllocation` also implements ``partial_fit`` method. This is used
793+
when data can be fetched sequentially.
794+
795+
.. topic:: Examples:
796+
797+
* :ref:`example_applications_topics_extraction_with_nmf_lda.py`
798+
799+
.. topic:: References:
800+
801+
* `"Latent Dirichlet Allocation"
802+
<https://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf>`_
803+
D. Blei, A. Ng, M. Jordan, 2003
804+
805+
* `"Online Learning for Latent Dirichlet Allocation”
806+
<https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf>`_
807+
M. Hoffman, D. Blei, F. Bach, 2010
808+
809+
* `"Stochastic Variational Inference"
810+
<http://www.columbia.edu/~jwp2128/Papers/HoffmanBleiWangPaisley2013.pdf>`_
811+
M. Hoffman, D. Blei, C. Wang, J. Paisley, 2013

doc/modules/feature_extraction.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -398,7 +398,7 @@ suitable for usage by a classifier it is very common to use the tf–idf
398398
transform.
399399

400400
Tf means **term-frequency** while tf–idf means term-frequency times
401-
**inverse document-frequency**. This is a originally a term weighting
401+
**inverse document-frequency**. This was originally a term weighting
402402
scheme developed for information retrieval (as a ranking function
403403
for search engines results), that has also found good use in document
404404
classification and clustering.
@@ -576,7 +576,7 @@ Finally it is possible to discover the main topics of a corpus by
576576
relaxing the hard assignment constraint of clustering, for instance by
577577
using :ref:`NMF`:
578578

579-
* :ref:`example_applications_topics_extraction_with_nmf.py`
579+
* :ref:`example_applications_topics_extraction_with_nmf_lda.py`
580580

581581

582582
Limitations of the Bag of Words representation

doc/modules/lda_qda.rst

Lines changed: 73 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -8,58 +8,105 @@ Linear and quadratic discriminant analysis
88

99
Linear discriminant analysis (:class:`lda.LDA`) and
1010
quadratic discriminant analysis (:class:`qda.QDA`)
11-
are two classic classifiers, with, as their names suggest, a linear and a
11+
are two standard classifiers, with, as their names suggest, a linear and a
1212
quadratic decision surface, respectively.
1313

1414
These classifiers are attractive because they have closed-form solutions that
15-
can be easily computed, are inherently multiclass,
16-
and have proven to work well in practice.
17-
Also there are no parameters to tune for these algorithms.
15+
can be easily computed, are inherently multiclass, have proven to work well in practice and have
16+
no hyperparameters to tune.
1817

1918
.. |ldaqda| image:: ../auto_examples/classification/images/plot_lda_qda_001.png
2019
:target: ../auto_examples/classification/plot_lda_qda.html
2120
:scale: 80
2221

2322
.. centered:: |ldaqda|
2423

25-
The plot shows decision boundaries for LDA and QDA. The bottom row
26-
demonstrates that LDA can only learn linear boundaries, while QDA can learn
24+
The plot shows decision boundaries for LDA and QDA. The first row shows that,
25+
when the classes covariances are the same, LDA and QDA yield the same result
26+
(up to a small difference resulting from the implementation). The bottom row demonstrates that in general,
27+
LDA can only learn linear boundaries, while QDA can learn
2728
quadratic boundaries and is therefore more flexible.
2829

2930
.. topic:: Examples:
3031

3132
:ref:`example_classification_plot_lda_qda.py`: Comparison of LDA and QDA on synthetic data.
3233

33-
3434
Dimensionality reduction using LDA
3535
==================================
3636

37-
:class:`lda.LDA` can be used to perform supervised dimensionality reduction by
38-
projecting the input data to a subspace consisting of the most
39-
discriminant directions.
37+
:class:`lda.LDA` can be used to perform supervised dimensionality reduction, by
38+
projecting the input data to a linear subspace consisting of the directions which maximize the
39+
separation between classes (in a precise sense discussed in the mathematics section below).
40+
The dimension of the output is necessarily less that the number of classes,
41+
so this is a in general a rather strong dimensionality reduction, and only makes senses
42+
in a multiclass setting.
43+
4044
This is implemented in :func:`lda.LDA.transform`. The desired
4145
dimensionality can be set using the ``n_components`` constructor
4246
parameter. This parameter has no influence on :func:`lda.LDA.fit` or :func:`lda.LDA.predict`.
4347

48+
.. topic:: Examples:
49+
50+
:ref:`example_decomposition_plot_pca_vs_lda.py`: Comparison of LDA and PCA for dimensionality reduction of the Iris dataset
51+
52+
Mathematical formulation of the LDA and QDA classifiers
53+
=======================================================
54+
55+
Both LDA and QDA can be derived from simple probabilistic models
56+
which model the class conditional distribution of the data :math:`P(X|y=k)`
57+
for each class :math:`k`. Predictions can then be obtained by using Bayes' rule:
58+
59+
.. math::
60+
P(y=k | X) = \frac{P(X | y=k) P(y=k)}{P(X)} = \frac{P(X | y=k) P(y = k)}{ \sum_{l} P(X | y=l) \cdot P(y=l)}
61+
62+
and we select the class :math:`k` which maximizes this conditional probability.
63+
64+
More specifically, for linear and quadratic discriminant analysis, :math:`P(X|y)`
65+
is modelled as a multivariate Gaussian distribution with density:
4466

45-
Mathematical Idea
46-
=================
67+
.. math:: p(X | y=k) = \frac{1}{(2\pi)^n |\Sigma_k|^{1/2}}\exp\left(-\frac{1}{2} (X-\mu_k)^t \Sigma_k^{-1} (X-\mu_k)\right)
4768

48-
Both methods work by modeling the class conditional distribution of the data :math:`P(X|y=k)`
49-
for each class :math:`k`. Predictions can be obtained by using Bayes' rule:
69+
To use this model as a classifier, we just need to estimate from the training data
70+
the class priors :math:`P(y=k)` (by the proportion of instances of class :math:`k`), the
71+
class means :math:`\mu_k` (by the empirical sample class means) and the covariance matrices
72+
(either by the empirical sample class covariance matrices, or by a regularized estimator: see the section on shrinkage below).
73+
74+
In the case of LDA, the Gaussians for each class are assumed
75+
to share the same covariance matrix: :math:`\Sigma_k = \Sigma` for all :math:`k`.
76+
This leads to linear decision surfaces between, as can be seen by comparing the the log-probability ratios
77+
:math:`\log[P(y=k | X) / P(y=l | X)]`:
5078

5179
.. math::
52-
P(y | X) = P(X | y) \cdot P(y) / P(X) = P(X | y) \cdot P(Y) / ( \sum_{y'} P(X | y') \cdot p(y'))
80+
\log\left(\frac{P(y=k|X)}{P(y=l | X)}\right) = 0 \Leftrightarrow (\mu_k-\mu_l)\Sigma^{-1} X = \frac{1}{2} (\mu_k^t \Sigma^{-1} \mu_k - \mu_l^t \Sigma^{-1} \mu_l)
81+
82+
In the case of QDA, there are no assumptions on the covariance matrices :math:`\Sigma_k` of the Gaussians,
83+
leading to quadratic decision surfaces. See [#1]_ for more details.
84+
85+
.. note:: **Relation with Gaussian Naive Bayes**
86+
87+
If in the QDA model one assumes that the covariance matrices are diagonal, then
88+
this means that we assume the classes are conditionally independent,
89+
and the resulting classifier is equivalent to the Gaussian Naive Bayes classifier :class:`GaussianNB`.
90+
91+
Mathematical formulation of LDA dimensionality reduction
92+
===========================================================
93+
94+
To understand the use of LDA in dimensionality reduction, it is useful to start
95+
with a geometric reformulation of the LDA classification rule explained above.
96+
We write :math:`K` for the total number of target classes.
97+
Since in LDA we assume that all classes have the same estimated covariance :math:`\Sigma`, we can rescale the
98+
data so that this covariance is the identity:
5399

54-
In linear and quadratic discriminant analysis, :math:`P(X|y)`
55-
is modelled as a Gaussian distribution.
56-
In the case of LDA, the Gaussians for each class are assumed to share the same covariance matrix.
57-
This leads to a linear decision surface, as can be seen by comparing the the log-probability rations
58-
:math:`log[P(y=k | X) / P(y=l | X)]`.
100+
.. math:: X^* = D^{-1/2}U^t X\text{ with }\Sigma = UDU^t
59101

60-
In the case of QDA, there are no assumptions on the covariance matrices of the Gaussians,
61-
leading to a quadratic decision surface.
102+
Then one can show that to classify a data point after scaling is equivalent to finding the estimated class mean :math:`\mu^*_k` which is
103+
closest to the data point in the Euclidean distance. But this can be done just as well after projecting on the :math:`K-1` affine subspace :math:`H_K`
104+
generated by all the :math:`\mu^*_k` for all classes. This shows that, implicit in the LDA classifier, there is
105+
a dimensionality reduction by linear projection onto a :math:`K-1` dimensional space.
62106

107+
We can reduce the dimension even more, to a chosen :math:`L`, by projecting onto the linear subspace :math:`H_L` which
108+
maximize the variance of the :math:`\mu^*_k` after projection (in effect, we are doing a form of PCA for the transformed class means :math:`\mu^*_k`).
109+
This :math:`L` corresponds to the ``n_components`` parameter in the :func:`lda.LDA.transform` method. See [#1]_ for more details.
63110

64111
Shrinkage
65112
=========
@@ -70,7 +117,7 @@ features. In this scenario, the empirical sample covariance is a poor
70117
estimator. Shrinkage LDA can be used by setting the ``shrinkage`` parameter of
71118
the :class:`lda.LDA` class to 'auto'. This automatically determines the
72119
optimal shrinkage parameter in an analytic way following the lemma introduced
73-
by Ledoit and Wolf. Note that currently shrinkage only works when setting the
120+
by Ledoit and Wolf [#2]_. Note that currently shrinkage only works when setting the
74121
``solver`` parameter to 'lsqr' or 'eigen'.
75122

76123
The ``shrinkage`` parameter can also be manually set between 0 and 1. In
@@ -111,7 +158,8 @@ a high number of features.
111158

112159
.. topic:: References:
113160

114-
Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning. Springer, 2009.
161+
.. [#1] "The Elements of Statistical Learning", Hastie T., Tibshirani R.,
162+
Friedman J., Section 4.3, p.106-119, 2008.
115163
116-
Ledoit O, Wolf M. Honey, I Shrunk the Sample Covariance Matrix. The Journal of Portfolio
164+
.. [#2] Ledoit O, Wolf M. Honey, I Shrunk the Sample Covariance Matrix. The Journal of Portfolio
117165
Management 30(4), 110-119, 2004.

doc/modules/model_evaluation.rst

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -353,6 +353,23 @@ In the multilabel case with binary label indicators: ::
353353
for an example of accuracy score usage using permutations of
354354
the dataset.
355355

356+
.. _cohen_kappa:
357+
358+
Cohen's kappa
359+
-------------
360+
361+
The function :func:`cohen_kappa_score` computes Cohen's kappa statistic.
362+
This measure is intended to compare labelings by different human annotators,
363+
not a classifier versus a ground truth.
364+
365+
The kappa score (see docstring) is a number between -1 and 1.
366+
Scores above .8 are generally considered good agreement;
367+
zero or lower means no agreement (practically random labels).
368+
369+
Kappa scores can be computed for binary or multiclass problems,
370+
but not for multilabel problems (except by manually computing a per-label score)
371+
and not for more than two annotators.
372+
356373
.. _confusion_matrix:
357374

358375
Confusion matrix

0 commit comments

Comments
 (0)