Skip to content

Commit 8b1b194

Browse files
committed
more work on the classifier section
1 parent 40beda3 commit 8b1b194

File tree

1 file changed

+69
-1
lines changed

1 file changed

+69
-1
lines changed

tutorial/general_concepts.rst

Lines changed: 69 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -264,7 +264,75 @@ petals and sepals. This is a classification task, hence we have::
264264

265265
>>> X, y = iris.data, iris.target
266266

267-
TODO: sample SVM classifier
267+
Once the data has this format it is trivial to train a classifier,
268+
for instance a support vector machine with a linear kernel (or lack
269+
of thereof)::
270+
271+
>>> from scikits.learn.svm import LinearSVC
272+
>>> clf = LinearSVC()
273+
274+
``clf`` is a statistical model that has parameters that control the
275+
learning algorithm (those parameters are sometimes called the
276+
hyper-parameters). Those hyperparameters can be supplied by the
277+
user in the constructore of the model. We will explain later choose
278+
a good combination either using simple empirical rules or data
279+
driven selection::
280+
281+
>>> clf
282+
LinearSVC(loss='l2', C=1.0, intercept_scaling=1, fit_intercept=True,
283+
eps=0.0001, penalty='l2', multi_class=False, dual=True)
284+
285+
By default the real model parameters are not initialized. They will be
286+
automatically be tuned from the data by calling the ``fit`` method::
287+
288+
>>> clf = clf.fit(X, y)
289+
290+
>>> clf.coef_
291+
array([[ 0.18423474, 0.45122764, -0.80794654, -0.45071379],
292+
[ 0.04864394, -0.88914385, 0.40540293, -0.93720122],
293+
[-0.85086062, -0.98671553, 1.38098573, 1.8653574 ]])
294+
295+
>>> clf.intercept_
296+
array([ 0.10956015, 1.6738296 , -1.70973044])
297+
298+
Once the model is trained, it can be used to predict the most likely outcome on
299+
unseen data. For instance let us define a list of a simple sample that looks
300+
like the first sample of the iris dataset::
301+
302+
>>> X_new = [[ 5.0, 3.6, 1.3, 0.25]]
303+
304+
>>> clf.predict(X_new)
305+
array([0], dtype=int32)
306+
307+
The outcome is ``0`` which the id of the first iris class namely
308+
'setosa'.
309+
310+
Some ``scikit-learn`` classifiers can further predicts probabilities
311+
of the outcome. This is the case of logistic regression models::
312+
313+
>>> from scikits.learn.linear_model import LogisticRegression
314+
>>> clf2 = LogisticRegression().fit(X, y)
315+
>>> clf2
316+
LogisticRegression(C=1.0, intercept_scaling=1, fit_intercept=True, eps=0.0001,
317+
penalty='l2', dual=False)
318+
319+
>>> clf2.predict_proba(X_new)
320+
array([[ 9.07512928e-01, 9.24770379e-02, 1.00343962e-05]])
321+
322+
This means that the model estimates that the sample in ``X_new`` has:
323+
324+
- 90% likelyhood to be belong to the 'setosa' class
325+
326+
- 9% likelyhood to be belong to the 'versicolor' class
327+
328+
- 1% likelyhood to be belong to the 'virginica' class
329+
330+
Of course the ``predict`` method that output the label id of the
331+
most likely outcome is also available::
332+
333+
>>> clf2.predict(X_new)
334+
array([0], dtype=int32)
335+
268336

269337
TODO: table of scikit-learn classifier models
270338

0 commit comments

Comments
 (0)