Skip to content

Commit c096796

Browse files
committed
removing confusing section + slight reorg
1 parent 21a06cf commit c096796

File tree

2 files changed

+12
-33
lines changed

2 files changed

+12
-33
lines changed

tutorial/general_concepts.rst

Lines changed: 10 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -208,25 +208,6 @@ Practical implementations of such feature extraction strategies
208208
will be presented in the last sections of this tutorial.
209209

210210

211-
How to devise a "good" feature extraction strategy
212-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
213-
214-
The feature extraction strategy both depends on the task we are
215-
trying to perform and the nature of the collected data. Therefore
216-
there is no formal rule to define which strategy is the best.
217-
218-
A good rule of thumb is to imagine a human-being performing the
219-
task the machine is trying to accomplish using only the numerical
220-
features provided to the machine.
221-
222-
Usually the feature extraction is useful if and only if two samples
223-
**judged similar in real life** by the human-being are **close
224-
according to some similarity metric of the feature space**.
225-
226-
In other words, the feature extraction strategy must somehow preserve
227-
the intuitive topology of the sample set.
228-
229-
230211
Supervised Learning: ``model.fit(X, y)``
231212
----------------------------------------
232213

@@ -279,6 +260,13 @@ of thereof)::
279260
>>> from scikits.learn.svm import LinearSVC
280261
>>> clf = LinearSVC()
281262

263+
.. note::
264+
265+
Whenever you import a scikit-learn class or function of the first time,
266+
you are advised to read the docstring by using the ``?`` magic suffix
267+
of ipython, for instance type: ``LinearSVC?``.
268+
269+
282270
``clf`` is a statistical model that has parameters that control the
283271
learning algorithm (those parameters are sometimes called the
284272
hyper-parameters). Those hyperparameters can be supplied by the
@@ -752,7 +740,7 @@ using for fitting the model:
752740

753741

754742
The overfitting issue
755-
+++++++++++++++++++++
743+
~~~~~~~~~~~~~~~~~~~~~
756744

757745
The problem lies in the fact that some models can be subject to the
758746
**overfitting** issue: they can **learn the training data by heart**
@@ -769,7 +757,7 @@ whether your model is overfitting or not.
769757

770758

771759
Solutions to overfitting
772-
++++++++++++++++++++++++
760+
~~~~~~~~~~~~~~~~~~~~~~~~
773761

774762
The solution to this issue is twofold:
775763

@@ -786,7 +774,7 @@ The solution to this issue is twofold:
786774

787775

788776
Measuring classification performance on a test set
789-
++++++++++++++++++++++++++++++++++++++++++++++++++
777+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
790778

791779
Here is an example on you to split the data on the iris dataset.
792780

@@ -846,9 +834,6 @@ Key takeaway points
846834

847835
- Build ``X`` (features vectors) with shape ``(n_samples, n_features)``
848836

849-
- Metrics in feature space should try to preserve the intuitive pairwise
850-
"closeness" of samples
851-
852837
- Supervised learning: ``clf.fit(X, y)`` and then ``clf.predict(X_new)``
853838

854839
- Classification: ``y`` is an array of integers

tutorial/working_with_text_data.rst

Lines changed: 2 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -38,15 +38,9 @@ We can now load the list of files matching those categories as follows::
3838
>>> training_set = load_files('data/twenty_newsgroups/20news-bydate-train',
3939
... categories=categories)
4040

41-
.. note::
4241

43-
Whenever you import a scikit-learn class or function of the first time,
44-
you are advised to read the docstring by using the ``?`` magic suffix
45-
of ipython, for instance type: ``load_files?``.
46-
47-
48-
The returned dataset is a ``scikit-learn`` "bunch": a simple class
49-
holder with fields that can be both accessed as python ``dict``
42+
The returned dataset is a ``scikit-learn`` "bunch": a simple holder
43+
object with fields that can be both accessed as python ``dict``
5044
keys or ``object`` attributes for convenience, for instance the
5145
``target_names`` holds the list of the requested category names::
5246

0 commit comments

Comments
 (0)