11Machine Learning 101
22====================
33
4+ Machine Learning is about building **programs with tunable parameters **
5+ (typically an array of floating point values) that are adjusted
6+ automatically so as to improve its behavior by **adapting to
7+ previously seen data **.
48
5- Feature extraction
6- -------------------
9+ Machine Learning can be considered a **subfield of Artificial
10+ Intelligence ** since those algorithms can be seen as building blocks
11+ to make computer learn to behave more intelligently by somehow
12+ **generalizing ** rather that just storing and retrieving data items
13+ like a database system would do.
714
8- - Principles
15+ The following will introduce the main concepts used to qualify
16+ machine learning algorithms, show how those concepts match with the
17+ ``scikit-learn `` API and give example applications.
18+
19+
20+ Features and feature extraction
21+ -------------------------------
22+
23+ Most machine learning algorithms implemented in ``scikit-learn ``
24+ expect a numpy array as input ``X ``. The expected shape of X is
25+ ``(n_samples, n_features) ``.
26+
27+ :``n_samples ``:
28+
29+ The number of samples: each sample is an item to process (e.g.
30+ classifiy). A sample can be a document, a picture, a sound, a
31+ video, a row in database or CSV file, or whatever you can
32+ describe with a fixed set of quantitative traits.
33+
34+ :``n_features ``:
35+
36+ The number of features or distinct traits that can be used to
37+ describe each item in a quantitative manner.
38+
39+
40+ The number of features must be fixed in advance. However it can be
41+ very high dimensional (e.g. millions of features) with most of them
42+ being zeros for a given sample. In this case we use ``scipy.sparse ``
43+ matrices instead of ``numpy `` arrays so has to make the data fit
44+ in memory.
45+
46+
47+ A simple example: the iris dataset
48+ ----------------------------------
49+
50+ The machine learning community often uses a simple flowers database
51+ were each row in the database (or CSV file) is a set of measurements
52+ of an individual iris flower.
53+
54+ .. figure :: images/Virginia_Iris.png
55+ :scale: 100 %
56+ :align: center
57+ :alt: Photo of Iris Virginia
58+
59+ Iris Virginia (source: Wikipedia)
60+
61+
62+ Each sample in this dataset is described by 4 features and can
63+ belong to one of the target classes:
64+
65+ :Features in the Iris dataset:
66+
67+ 0. sepal length in cm
68+ 1. sepal width in cm
69+ 2. petal length in cm
70+ 3. petal width in cm
71+
72+ :Target classes to predict:
73+
74+ 0. Iris Setosa
75+ 1. Iris Versicolour
76+ 2. Iris Virginica
77+
78+
79+ ``scikit-learn `` embeds a copy of the iris CSV file along with a
80+ helper function to load it into numpy arrays::
81+
82+ >>> from scikits.learn.datasets import load_iris
83+ >>> iris = load_iris()
84+
85+ The features of each sample flower is stored in the ``data `` attribute
86+ of the dataset::
87+
88+ >>> n_samples, n_features = iris.data.shape
89+ >>> n_samples
90+ 150
91+
92+ >>> n_features
93+ 4
94+
95+ >>> iris.data[0]
96+ array([ 5.1, 3.5, 1.4, 0.2])
97+
98+
99+ The information about the class of each sample is stored in the
100+ ``target `` attribute of the dataset::
101+
102+ >>> len(iris.target) == n_samples
103+ True
104+
105+ >>> iris.target
106+ array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
107+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
108+ 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
109+ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
110+ 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
111+ 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
112+ 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
113+
114+ The names of the classes is stored in the last attribute, namely
115+ ``target_names ``::
116+
117+ >>> list(iris.target_names)
118+ ['setosa', 'versicolor', 'virginica']
119+
120+
121+ Handling categorical features
122+ -----------------------------
123+
124+ TODO
125+
126+ Extracting features from unstructured data
127+ ------------------------------------------
128+
129+ The previous example deals with features that are readily available
130+ in a structured datasets with rows and columns of numerical or
131+ categorical values.
132+
133+ However, **most of the produced data is not readily available in a
134+ structured representation ** such as SQL, CSV, XML, JSON or RDF.
135+
136+ Here is an overview of strategies to turn unstructed data items
137+ into arrays of numerical features.
138+
139+
140+ :Text documents:
141+
142+ Count the frequency of each word or pair of consecutive words
143+ in each document. This approach is called the **Bag of Words **.
144+
145+ Note: we include other files formats such as HTML and PDF in
146+ this category: an ad-hoc preprocessing step is required to
147+ extract the plain text in UTF-8 encoding for instance.
148+
149+
150+ :Images:
151+
152+ - Rescale the picture to a fixed size and **take all the raw
153+ pixels values ** (with or without luminosity normalization)
154+
155+ - Take some transformation of the signal (gradients in each
156+ pixel, wavelets transforms...)
157+
158+ - Compute the Euclidean, Manhattan or cosine **similarities of
159+ the sample to a set reference prototype images ** aranged in a
160+ code book. The code book may have been previously extracted
161+ on the same dataset using an unsupervised learning algorithms
162+ on the raw pixel signal.
163+
164+ Each feature value is the distance to one element of the code
165+ book.
166+
167+ - Perform **local feature extraction **: split the picture into
168+ small regions and perform feature extraction locally in each
169+ area.
170+
171+ Then combine all the feature of the individual areas into a
172+ single array.
173+
174+ :Sounds:
175+
176+ Same strategy as for images with in a 1D space instead of 2D
177+
178+
179+ Practical implementations of such feature extraction strategies
180+ will be presented in the last sections of this tutorial.
181+
182+
183+ How to evaluate the quality of feature extraction strategy
184+ ----------------------------------------------------------
185+
186+ The rule of thumb is two samples that seem close or related to
187+
188+ And conversely, samples that seem close in
9189
10190
11191Supervised Learning: ``model.fit(X, y) ``
@@ -28,3 +208,7 @@ Unsupervised Learning: ``model.fit(X)``
28208 - Real life applications
29209
30210
211+
212+ Training set, test sets and overfitting
213+ ---------------------------------------
214+
0 commit comments