Skip to content

Commit 39673b6

Browse files
committed
some work on the introductionary section
1 parent 2a3f367 commit 39673b6

File tree

4 files changed

+217
-6
lines changed

4 files changed

+217
-6
lines changed

tutorial/finding_help.rst

Lines changed: 28 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,34 @@
11
Finding help
22
============
33

4-
- the scikit-learn mailing list
54

6-
- http://metaoptimize.com/qa
5+
The project mailing list
6+
------------------------
77

8-
- http://quora.com/Machine-Learning
8+
If you encounter a bug with ``scikit-learn`` or something that needs
9+
clarification in the docstring or the online documentation, please feel free to
10+
ask on the `Mailing List <http://scikit-learn.sourceforge.net/support.html>`_
11+
12+
13+
Q&A communities with Machine Learning practictioners
14+
----------------------------------------------------
15+
16+
:Metaoptimize/QA:
17+
18+
A forum for Machine Learning, Natural Language Processing and
19+
other Data Analytics discussions (similar to what Stackoverflow
20+
is for developers):
21+
22+
http://metaoptimize.com/qa
23+
24+
:Quora.com:
25+
26+
Quora as a topic for Machine Learning related questions that also features
27+
some interesting discussions:
28+
29+
http://quora.com/Machine-Learning
30+
31+
Have a look at the best questions section, eg:
32+
33+
http://www.quora.com/What-are-some-good-resources-for-learning-about-machine-learning
934

tutorial/general_concepts.rst

Lines changed: 187 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,191 @@
11
Machine Learning 101
22
====================
33

4+
Machine Learning is about building **programs with tunable parameters**
5+
(typically an array of floating point values) that are adjusted
6+
automatically so as to improve its behavior by **adapting to
7+
previously seen data**.
48

5-
Feature extraction
6-
-------------------
9+
Machine Learning can be considered a **subfield of Artificial
10+
Intelligence** since those algorithms can be seen as building blocks
11+
to make computer learn to behave more intelligently by somehow
12+
**generalizing** rather that just storing and retrieving data items
13+
like a database system would do.
714

8-
- Principles
15+
The following will introduce the main concepts used to qualify
16+
machine learning algorithms, show how those concepts match with the
17+
``scikit-learn`` API and give example applications.
18+
19+
20+
Features and feature extraction
21+
-------------------------------
22+
23+
Most machine learning algorithms implemented in ``scikit-learn``
24+
expect a numpy array as input ``X``. The expected shape of X is
25+
``(n_samples, n_features)``.
26+
27+
:``n_samples``:
28+
29+
The number of samples: each sample is an item to process (e.g.
30+
classifiy). A sample can be a document, a picture, a sound, a
31+
video, a row in database or CSV file, or whatever you can
32+
describe with a fixed set of quantitative traits.
33+
34+
:``n_features``:
35+
36+
The number of features or distinct traits that can be used to
37+
describe each item in a quantitative manner.
38+
39+
40+
The number of features must be fixed in advance. However it can be
41+
very high dimensional (e.g. millions of features) with most of them
42+
being zeros for a given sample. In this case we use ``scipy.sparse``
43+
matrices instead of ``numpy`` arrays so has to make the data fit
44+
in memory.
45+
46+
47+
A simple example: the iris dataset
48+
----------------------------------
49+
50+
The machine learning community often uses a simple flowers database
51+
were each row in the database (or CSV file) is a set of measurements
52+
of an individual iris flower.
53+
54+
.. figure:: images/Virginia_Iris.png
55+
:scale: 100 %
56+
:align: center
57+
:alt: Photo of Iris Virginia
58+
59+
Iris Virginia (source: Wikipedia)
60+
61+
62+
Each sample in this dataset is described by 4 features and can
63+
belong to one of the target classes:
64+
65+
:Features in the Iris dataset:
66+
67+
0. sepal length in cm
68+
1. sepal width in cm
69+
2. petal length in cm
70+
3. petal width in cm
71+
72+
:Target classes to predict:
73+
74+
0. Iris Setosa
75+
1. Iris Versicolour
76+
2. Iris Virginica
77+
78+
79+
``scikit-learn`` embeds a copy of the iris CSV file along with a
80+
helper function to load it into numpy arrays::
81+
82+
>>> from scikits.learn.datasets import load_iris
83+
>>> iris = load_iris()
84+
85+
The features of each sample flower is stored in the ``data`` attribute
86+
of the dataset::
87+
88+
>>> n_samples, n_features = iris.data.shape
89+
>>> n_samples
90+
150
91+
92+
>>> n_features
93+
4
94+
95+
>>> iris.data[0]
96+
array([ 5.1, 3.5, 1.4, 0.2])
97+
98+
99+
The information about the class of each sample is stored in the
100+
``target`` attribute of the dataset::
101+
102+
>>> len(iris.target) == n_samples
103+
True
104+
105+
>>> iris.target
106+
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
107+
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
108+
0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
109+
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
110+
1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
111+
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
112+
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
113+
114+
The names of the classes is stored in the last attribute, namely
115+
``target_names``::
116+
117+
>>> list(iris.target_names)
118+
['setosa', 'versicolor', 'virginica']
119+
120+
121+
Handling categorical features
122+
-----------------------------
123+
124+
TODO
125+
126+
Extracting features from unstructured data
127+
------------------------------------------
128+
129+
The previous example deals with features that are readily available
130+
in a structured datasets with rows and columns of numerical or
131+
categorical values.
132+
133+
However, **most of the produced data is not readily available in a
134+
structured representation** such as SQL, CSV, XML, JSON or RDF.
135+
136+
Here is an overview of strategies to turn unstructed data items
137+
into arrays of numerical features.
138+
139+
140+
:Text documents:
141+
142+
Count the frequency of each word or pair of consecutive words
143+
in each document. This approach is called the **Bag of Words**.
144+
145+
Note: we include other files formats such as HTML and PDF in
146+
this category: an ad-hoc preprocessing step is required to
147+
extract the plain text in UTF-8 encoding for instance.
148+
149+
150+
:Images:
151+
152+
- Rescale the picture to a fixed size and **take all the raw
153+
pixels values** (with or without luminosity normalization)
154+
155+
- Take some transformation of the signal (gradients in each
156+
pixel, wavelets transforms...)
157+
158+
- Compute the Euclidean, Manhattan or cosine **similarities of
159+
the sample to a set reference prototype images** aranged in a
160+
code book. The code book may have been previously extracted
161+
on the same dataset using an unsupervised learning algorithms
162+
on the raw pixel signal.
163+
164+
Each feature value is the distance to one element of the code
165+
book.
166+
167+
- Perform **local feature extraction**: split the picture into
168+
small regions and perform feature extraction locally in each
169+
area.
170+
171+
Then combine all the feature of the individual areas into a
172+
single array.
173+
174+
:Sounds:
175+
176+
Same strategy as for images with in a 1D space instead of 2D
177+
178+
179+
Practical implementations of such feature extraction strategies
180+
will be presented in the last sections of this tutorial.
181+
182+
183+
How to evaluate the quality of feature extraction strategy
184+
----------------------------------------------------------
185+
186+
The rule of thumb is two samples that seem close or related to
187+
188+
And conversely, samples that seem close in
9189

10190

11191
Supervised Learning: ``model.fit(X, y)``
@@ -28,3 +208,7 @@ Unsupervised Learning: ``model.fit(X)``
28208
- Real life applications
29209

30210

211+
212+
Training set, test sets and overfitting
213+
---------------------------------------
214+

tutorial/images/Virginia_Iris.png

65.6 KB
Loading

tutorial/working_with_text_data.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -92,3 +92,5 @@ before re-training on the complete dataset later.
9292
Extracting features from text files
9393
-----------------------------------
9494

95+
96+

0 commit comments

Comments
 (0)