Skip to content

Commit 8ee70f4

Browse files
committed
add validation exercise
1 parent 1bb0bc7 commit 8ee70f4

File tree

3 files changed

+240
-28
lines changed

3 files changed

+240
-28
lines changed

notebooks/06B_validation_exercise.ipynb

Lines changed: 195 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -25,13 +25,31 @@
2525
"metadata": {},
2626
"outputs": []
2727
},
28+
{
29+
"cell_type": "markdown",
30+
"metadata": {},
31+
"source": [
32+
"This exercise covers cross-validation of regression models on the Diabetes\n",
33+
"dataset. The diabetes data consists of 10 physiological variables (age, sex, weight, blood pressure)\n",
34+
"measure on 442 patients, and an indication of disease progression after one year:"
35+
]
36+
},
2837
{
2938
"cell_type": "code",
3039
"collapsed": false,
3140
"input": [
3241
"from sklearn.datasets import load_diabetes\n",
3342
"data = load_diabetes()\n",
34-
"X, y = data.data, data.target\n",
43+
"X, y = data.data, data.target"
44+
],
45+
"language": "python",
46+
"metadata": {},
47+
"outputs": []
48+
},
49+
{
50+
"cell_type": "code",
51+
"collapsed": false,
52+
"input": [
3553
"print X.shape"
3654
],
3755
"language": "python",
@@ -42,97 +60,246 @@
4260
"cell_type": "code",
4361
"collapsed": false,
4462
"input": [
45-
"from sklearn.linear_model import Ridge, Lasso\n",
63+
"print y.shape"
64+
],
65+
"language": "python",
66+
"metadata": {},
67+
"outputs": []
68+
},
69+
{
70+
"cell_type": "markdown",
71+
"metadata": {},
72+
"source": [
73+
"Here we'll be fitting two regularized linear models,\n",
74+
"*Ridge Regression*, which uses $\\ell_2$ regularlization,\n",
75+
"and *Lasso Regression*, which uses $\\ell_1$ regularization."
76+
]
77+
},
78+
{
79+
"cell_type": "code",
80+
"collapsed": false,
81+
"input": [
82+
"from sklearn.linear_model import Ridge, Lasso"
83+
],
84+
"language": "python",
85+
"metadata": {},
86+
"outputs": []
87+
},
88+
{
89+
"cell_type": "markdown",
90+
"metadata": {},
91+
"source": [
92+
"We'll first use the default hyper-parameters to see the baseline estimator. We'll\n",
93+
"use the cross-validation score to determine goodness-of-fit."
94+
]
95+
},
96+
{
97+
"cell_type": "code",
98+
"collapsed": false,
99+
"input": [
46100
"from sklearn.cross_validation import cross_val_score\n",
47101
"\n",
48-
"alphas = np.logspace(-4, 0, 20)\n",
49-
"\n",
50-
"# plot the mean cross-validation score for a Ridge estimator and a Lasso estimator\n",
51-
"# as a function of alpha. Which is more difficult to tune?\n",
52-
"\n"
102+
"for Model in [Ridge, Lasso]:\n",
103+
" model = Model()\n",
104+
" print Model.__name__, cross_val_score(model, X, y).mean()"
53105
],
54106
"language": "python",
55107
"metadata": {},
56108
"outputs": []
57109
},
110+
{
111+
"cell_type": "markdown",
112+
"metadata": {},
113+
"source": [
114+
"We see that for the default hyper-parameter values, Lasso outperforms Ridge.\n",
115+
"But is this the case for the *optimal* hyperparameters of each model?"
116+
]
117+
},
118+
{
119+
"cell_type": "heading",
120+
"level": 2,
121+
"metadata": {},
122+
"source": [
123+
"Exercise: Basic Hyperparameter Optimization"
124+
]
125+
},
126+
{
127+
"cell_type": "markdown",
128+
"metadata": {},
129+
"source": [
130+
"Here spend some time writing a function which computes the cross-validation\n",
131+
"score as a function of ``alpha``, the strength of the regularization for\n",
132+
"``Lasso`` and ``Ridge``. We'll choose 20 values of ``alpha`` between\n",
133+
"0.0001 and 1:"
134+
]
135+
},
58136
{
59137
"cell_type": "code",
60138
"collapsed": false,
61139
"input": [
62-
"clf = Lasso(alpha=1.0).fit(X, y)\n",
63-
"clf.sparse_coef_.toarray()"
140+
"alphas = np.logspace(-3, -1, 30)\n",
141+
"\n",
142+
"# plot the mean cross-validation score for a Ridge estimator and a Lasso estimator\n",
143+
"# as a function of alpha. Which is more difficult to tune?"
64144
],
65145
"language": "python",
66146
"metadata": {},
67147
"outputs": []
68148
},
149+
{
150+
"cell_type": "heading",
151+
"level": 3,
152+
"metadata": {},
153+
"source": [
154+
"Solution"
155+
]
156+
},
69157
{
70158
"cell_type": "code",
71159
"collapsed": false,
72160
"input": [
73-
"from sklearn.linear_model import Lasso, LassoCV, Ridge, RidgeCV\n",
74-
"from sklearn.cross_validation import cross_val_score"
161+
"%load solutions/06B_basic_grid_search.py"
75162
],
76163
"language": "python",
77164
"metadata": {},
78165
"outputs": []
79166
},
167+
{
168+
"cell_type": "heading",
169+
"level": 2,
170+
"metadata": {},
171+
"source": [
172+
"Automatically Performing Grid Search"
173+
]
174+
},
175+
{
176+
"cell_type": "markdown",
177+
"metadata": {},
178+
"source": [
179+
"Because searching a grid of hyperparameters is such a common task, scikit-learn provides\n",
180+
"several hyper-parameter estimators to automate this. We'll explore this more in depth\n",
181+
"later in the tutorial, but for now it is interesting to see how ``GridSearchCV`` works:"
182+
]
183+
},
80184
{
81185
"cell_type": "code",
82186
"collapsed": false,
83187
"input": [
84-
"alphas = 10 ** np.linspace(-3, -1, 20)\n",
85-
"scores_L = [cross_val_score(Lasso(alpha), X, y)\n",
86-
" for alpha in alphas]\n",
87-
"scores_R = [cross_val_score(Ridge(alpha), X, y)\n",
88-
" for alpha in alphas]\n",
89-
"plt.plot(alphas, np.mean(scores_R, 1), label=\"Ridge\")\n",
90-
"plt.plot(alphas, np.mean(scores_L, 1), label=\"Lasso\")\n",
91-
"plt.xlabel('alpha')\n",
92-
"plt.ylabel('score')\n",
93-
"plt.legend()"
188+
"from sklearn.grid_search import GridSearchCV"
94189
],
95190
"language": "python",
96191
"metadata": {},
97192
"outputs": []
98193
},
194+
{
195+
"cell_type": "markdown",
196+
"metadata": {},
197+
"source": [
198+
"``GridSearchCV`` is constructed with an estimator, as well as a dictionary\n",
199+
"of parameter values to be searched. We can find the optimal parameters this\n",
200+
"way:"
201+
]
202+
},
99203
{
100204
"cell_type": "code",
101205
"collapsed": false,
102206
"input": [
103-
"from sklearn.linear_model import LassoCV"
207+
"for Model in [Ridge, Lasso]:\n",
208+
" gscv = GridSearchCV(Model(), dict(alpha=alphas), cv=3).fit(X, y)\n",
209+
" print Model.__name__, gscv.best_params_"
104210
],
105211
"language": "python",
106212
"metadata": {},
107213
"outputs": []
108214
},
215+
{
216+
"cell_type": "heading",
217+
"level": 2,
218+
"metadata": {},
219+
"source": [
220+
"Built-in Hyperparameter Search"
221+
]
222+
},
223+
{
224+
"cell_type": "markdown",
225+
"metadata": {},
226+
"source": [
227+
"For some models within scikit-learn, cross-validation can be performed more efficiently\n",
228+
"on large datasets. In this case, a cross-validated version of the particular model is\n",
229+
"included. The cross-validated versions of ``Ridge`` and ``Lasso`` are ``RidgeCV`` and\n",
230+
"``LassoCV``, respectively. The grid search on these estimators can be performed as\n",
231+
"follows:"
232+
]
233+
},
109234
{
110235
"cell_type": "code",
111236
"collapsed": false,
112237
"input": [
113-
"clf = LassoCV(alphas=alphas, cv=3).fit(X, y)\n",
114-
"print \"Lasso:\", clf.alpha_\n",
115-
"clf = RidgeCV(alphas=alphas, cv=3).fit(X, y)\n",
116-
"print \"Ridge:\", clf.alpha_"
238+
"from sklearn.linear_model import RidgeCV, LassoCV\n",
239+
"for Model in [RidgeCV, LassoCV]:\n",
240+
" model = Model(alphas=alphas, cv=3).fit(X, y)\n",
241+
" print Model.__name__, model.alpha_"
117242
],
118243
"language": "python",
119244
"metadata": {},
120245
"outputs": []
121246
},
247+
{
248+
"cell_type": "markdown",
249+
"metadata": {},
250+
"source": [
251+
"We see that the results match those returned by ``GridSearchCV``."
252+
]
253+
},
254+
{
255+
"cell_type": "heading",
256+
"level": 2,
257+
"metadata": {},
258+
"source": [
259+
"Exercise: Learning Curves"
260+
]
261+
},
262+
{
263+
"cell_type": "markdown",
264+
"metadata": {},
265+
"source": [
266+
"Here we'll apply our learning curves to the diabetes data. The question to answer is this:\n",
267+
"\n",
268+
"- Given the optimal models above, which is over-fitting and which is under-fitting the data?\n",
269+
"- To obtain better results, would you invest time and effort in gathering\n",
270+
" more **training samples**, or gathering more **attributes** for each sample?\n",
271+
" Recall the previous discussion of reading learning curves.\n",
272+
"\n",
273+
"You can follow the process used in the previous notebook to plot the learning curves.\n",
274+
"A good metric to use is the ``mean_squared_error``, which we'll import below:"
275+
]
276+
},
122277
{
123278
"cell_type": "code",
124279
"collapsed": false,
125280
"input": [
126-
"RidgeCV?"
281+
"from sklearn.metrics import mean_squared_error\n",
282+
"# define a function that computes the learning curve (i.e. mean_squared_error as a function\n",
283+
"# of training set size, for both training and test sets) and plot the result\n"
127284
],
128285
"language": "python",
129286
"metadata": {},
130287
"outputs": []
131288
},
289+
{
290+
"cell_type": "heading",
291+
"level": 3,
292+
"metadata": {},
293+
"source": [
294+
"Solution"
295+
]
296+
},
132297
{
133298
"cell_type": "code",
134299
"collapsed": false,
135-
"input": [],
300+
"input": [
301+
"%load solutions/06B_learning_curves.py"
302+
],
136303
"language": "python",
137304
"metadata": {},
138305
"outputs": []
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
for Model in [Lasso, Ridge]:
2+
scores = [cross_val_score(Model(alpha), X, y, cv=3).mean()
3+
for alpha in alphas]
4+
plt.plot(alphas, scores, label=Model.__name__)
5+
plt.legend(loc='lower left')
Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
from sklearn.metrics import explained_variance_score, mean_squared_error
2+
from sklearn.cross_validation import train_test_split
3+
4+
def plot_learning_curve(model, err_func=explained_variance_score, N=300, n_runs=10, n_sizes=50, ylim=None):
5+
sizes = np.linspace(5, N, n_sizes).astype(int)
6+
train_err = np.zeros((n_runs, n_sizes))
7+
validation_err = np.zeros((n_runs, n_sizes))
8+
for i in range(n_runs):
9+
for j, size in enumerate(sizes):
10+
xtrain, xtest, ytrain, ytest = train_test_split(
11+
X, y, train_size=size, random_state=i)
12+
# Train on only the first `size` points
13+
model.fit(xtrain, ytrain)
14+
validation_err[i, j] = err_func(ytest, model.predict(xtest))
15+
train_err[i, j] = err_func(ytrain, model.predict(xtrain))
16+
17+
plt.plot(sizes, validation_err.mean(axis=0), lw=2, label='validation')
18+
plt.plot(sizes, train_err.mean(axis=0), lw=2, label='training')
19+
20+
plt.xlabel('traning set size')
21+
plt.ylabel(err_func.__name__.replace('_', ' '))
22+
23+
plt.grid(True)
24+
25+
plt.legend(loc=0)
26+
27+
plt.xlim(0, N-1)
28+
29+
if ylim:
30+
plt.ylim(ylim)
31+
32+
33+
plt.figure(figsize=(10, 8))
34+
for i, model in enumerate([Lasso(0.01), Ridge(0.06)]):
35+
plt.subplot(221 + i)
36+
plot_learning_curve(model, ylim=(0, 1))
37+
plt.title(model.__class__.__name__)
38+
39+
plt.subplot(223 + i)
40+
plot_learning_curve(model, err_func=mean_squared_error, ylim=(0, 8000))

0 commit comments

Comments
 (0)