Skip to content

Commit aaeecc0

Browse files
committed
Update notebooks
1 parent c254e90 commit aaeecc0

File tree

3 files changed

+98
-42
lines changed

3 files changed

+98
-42
lines changed

notebooks/03_categorical_pipeline_sol_02.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -250,7 +250,7 @@
250250
"<div class=\"admonition important alert alert-info\">\n",
251251
"<p class=\"first admonition-title\" style=\"font-weight: bold;\">Important</p>\n",
252252
"<p>Which encoder should I use?</p>\n",
253-
"<table border=\"1\" class=\"colwidths-auto docutils\">\n",
253+
"<table border=\"1\" class=\"docutils\">\n",
254254
"<thead valign=\"bottom\">\n",
255255
"<tr><th class=\"head\"></th>\n",
256256
"<th class=\"head\">Meaningful order</th>\n",

notebooks/03_categorical_pipeline_visualization.ipynb

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
"cell_type": "markdown",
55
"metadata": {},
66
"source": [
7-
"# How to define a scikit-learn pipeline and visualize it"
7+
"# Visualizing scikit-learn pipelines in Jupyter"
88
]
99
},
1010
{
@@ -22,7 +22,7 @@
2222
"cell_type": "markdown",
2323
"metadata": {},
2424
"source": [
25-
"### First we load the dataset"
25+
"## First we load the dataset"
2626
]
2727
},
2828
{
@@ -86,7 +86,7 @@
8686
"cell_type": "markdown",
8787
"metadata": {},
8888
"source": [
89-
"### Then we create the pipeline"
89+
"## Then we create the pipeline"
9090
]
9191
},
9292
{
@@ -176,7 +176,7 @@
176176
"cell_type": "markdown",
177177
"metadata": {},
178178
"source": [
179-
"### Finally we score the model"
179+
"## Finally we score the model"
180180
]
181181
},
182182
{

notebooks/ensemble_hyperparameters.ipynb

Lines changed: 93 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -17,28 +17,12 @@
1717
"<div class=\"admonition caution alert alert-warning\">\n",
1818
"<p class=\"first admonition-title\" style=\"font-weight: bold;\">Caution!</p>\n",
1919
"<p class=\"last\">For the sake of clarity, no cross-validation will be used to estimate the\n",
20-
"testing error. We are only showing the effect of the parameters\n",
21-
"on the validation set of what should be the inner cross-validation.</p>\n",
20+
"variability of the testing error. We are only showing the effect of the\n",
21+
"parameters on the validation set of what should be the inner loop of a nested\n",
22+
"cross-validation.</p>\n",
2223
"</div>\n",
2324
"\n",
24-
"## Random forest\n",
25-
"\n",
26-
"The main parameter to tune for random forest is the `n_estimators` parameter.\n",
27-
"In general, the more trees in the forest, the better the generalization\n",
28-
"performance will be. However, it will slow down the fitting and prediction\n",
29-
"time. The goal is to balance computing time and generalization performance when\n",
30-
"setting the number of estimators when putting such learner in production.\n",
31-
"\n",
32-
"Then, we could also tune a parameter that controls the depth of each tree in\n",
33-
"the forest. Two parameters are important for this: `max_depth` and\n",
34-
"`max_leaf_nodes`. They differ in the way they control the tree structure.\n",
35-
"Indeed, `max_depth` will enforce to have a more symmetric tree, while\n",
36-
"`max_leaf_nodes` does not impose such constraint.\n",
37-
"\n",
38-
"Be aware that with random forest, trees are generally deep since we are\n",
39-
"seeking to overfit each tree on each bootstrap sample because this will be\n",
40-
"mitigated by combining them altogether. Assembling underfitted trees (i.e.\n",
41-
"shallow trees) might also lead to an underfitted forest."
25+
"We will start by loading the california housing dataset."
4226
]
4327
},
4428
{
@@ -56,6 +40,71 @@
5640
" data, target, random_state=0)"
5741
]
5842
},
43+
{
44+
"cell_type": "markdown",
45+
"metadata": {},
46+
"source": [
47+
"## Random forest\n",
48+
"\n",
49+
"The main parameter to select in random forest is the `n_estimators` parameter.\n",
50+
"In general, the more trees in the forest, the better the generalization\n",
51+
"performance will be. However, it will slow down the fitting and prediction\n",
52+
"time. The goal is to balance computing time and generalization performance\n",
53+
"when setting the number of estimators. Here, we fix `n_estimators=100`, which\n",
54+
"is already the default value.\n",
55+
"\n",
56+
"<div class=\"admonition caution alert alert-warning\">\n",
57+
"<p class=\"first admonition-title\" style=\"font-weight: bold;\">Caution!</p>\n",
58+
"<p class=\"last\">Tuning the <tt class=\"docutils literal\">n_estimators</tt> for random forests generally result in a waste of\n",
59+
"computer power. We just need to ensure that it is large enough so that doubling\n",
60+
"its value does not lead to a significant improvement of the validation error.</p>\n",
61+
"</div>\n",
62+
"\n",
63+
"Instead, we can tune the hyperparameter `max_features`, which controls the\n",
64+
"size of the random subset of features to consider when looking for the best\n",
65+
"split when growing the trees: smaller values for `max_features` will lead to\n",
66+
"more random trees with hopefully more uncorrelated prediction errors. However\n",
67+
"if `max_features` is too small, predictions can be too random, even after\n",
68+
"averaging with the trees in the ensemble.\n",
69+
"\n",
70+
"If `max_features` is set to `None`, then this is equivalent to setting\n",
71+
"`max_features=n_features` which means that the only source of randomness in\n",
72+
"the random forest is the bagging procedure."
73+
]
74+
},
75+
{
76+
"cell_type": "code",
77+
"execution_count": null,
78+
"metadata": {},
79+
"outputs": [],
80+
"source": [
81+
"print(f\"In this case, n_features={len(data.columns)}\")"
82+
]
83+
},
84+
{
85+
"cell_type": "markdown",
86+
"metadata": {},
87+
"source": [
88+
"We can also tune the different parameters that control the depth of each tree\n",
89+
"in the forest. Two parameters are important for this: `max_depth` and\n",
90+
"`max_leaf_nodes`. They differ in the way they control the tree structure.\n",
91+
"Indeed, `max_depth` will enforce to have a more symmetric tree, while\n",
92+
"`max_leaf_nodes` does not impose such constraint. If `max_leaf_nodes=None`\n",
93+
"then the number of leaf nodes is unlimited.\n",
94+
"\n",
95+
"The hyperparameter `min_samples_leaf` controls the minimum number of samples\n",
96+
"required to be at a leaf node. This means that a split point (at any depth) is\n",
97+
"only done if it leaves at least `min_samples_leaf` training samples in each of\n",
98+
"the left and right branches. A small value for `min_samples_leaf` means that\n",
99+
"some samples can become isolated when a tree is deep, promoting overfitting. A\n",
100+
"large value would prevent deep trees, which can lead to underfitting.\n",
101+
"\n",
102+
"Be aware that with random forest, trees are expected to be deep since we are\n",
103+
"seeking to overfit each tree on each bootstrap sample. Overfitting is\n",
104+
"mitigated when combining the trees altogether, whereas assembling underfitted\n",
105+
"trees (i.e. shallow trees) might also lead to an underfitted forest."
106+
]
107+
},
59108
{
60109
"cell_type": "code",
61110
"execution_count": null,
@@ -67,8 +116,9 @@
67116
"from sklearn.ensemble import RandomForestRegressor\n",
68117
"\n",
69118
"param_distributions = {\n",
70-
" \"n_estimators\": [1, 2, 5, 10, 20, 50, 100, 200, 500],\n",
71-
" \"max_leaf_nodes\": [2, 5, 10, 20, 50, 100],\n",
119+
" \"max_features\": [1, 2, 3, 5, None],\n",
120+
" \"max_leaf_nodes\": [10, 100, 1000, None],\n",
121+
" \"min_samples_leaf\": [1, 2, 5, 10, 20, 50, 100],\n",
72122
"}\n",
73123
"search_cv = RandomizedSearchCV(\n",
74124
" RandomForestRegressor(n_jobs=2), param_distributions=param_distributions,\n",
@@ -88,15 +138,21 @@
88138
"cell_type": "markdown",
89139
"metadata": {},
90140
"source": [
91-
"We can observe in our search that we are required to have a large\n",
92-
"number of leaves and thus deep trees. This parameter seems particularly\n",
93-
"impactful in comparison to the number of trees for this particular dataset:\n",
94-
"with at least 50 trees, the generalization performance will be driven by the\n",
95-
"number of leaves.\n",
96-
"\n",
97-
"Now we will estimate the generalization performance of the best model by\n",
98-
"refitting it with the full training set and using the test set for scoring on\n",
99-
"unseen data. This is done by default when calling the `.fit` method."
141+
"We can observe in our search that we are required to have a large number of\n",
142+
"`max_leaf_nodes` and thus deep trees. This parameter seems particularly\n",
143+
"impactful with respect to the other tuning parameters, but large values of\n",
144+
"`min_samples_leaf` seem to reduce the performance of the model.\n",
145+
"\n",
146+
"In practice, more iterations of random search would be necessary to precisely\n",
147+
"assert the role of each parameters. Using `n_iter=10` is good enough to\n",
148+
"quickly inspect the hyperparameter combinations that yield models that work\n",
149+
"well enough without spending too much computational resources. Feel free to\n",
150+
"try more interations on your own.\n",
151+
"\n",
152+
"Once the `RandomizedSearchCV` has found the best set of hyperparameters, it\n",
153+
"uses them to refit the model using the full training set. To estimate the\n",
154+
"generalization performance of the best model it suffices to call `.score` on\n",
155+
"the unseen data."
100156
]
101157
},
102158
{
@@ -180,8 +236,8 @@
180236
"\n",
181237
"<div class=\"admonition caution alert alert-warning\">\n",
182238
"<p class=\"first admonition-title\" style=\"font-weight: bold;\">Caution!</p>\n",
183-
"<p class=\"last\">Here, we tune the <tt class=\"docutils literal\">n_estimators</tt> but be aware that using early-stopping as\n",
184-
"in the previous exercise will be better.</p>\n",
239+
"<p class=\"last\">Here, we tune the <tt class=\"docutils literal\">n_estimators</tt> but be aware that is better to use\n",
240+
"<tt class=\"docutils literal\">early_stopping</tt> as done in the Exercise M6.04.</p>\n",
185241
"</div>\n",
186242
"\n",
187243
"In this search, we see that the `learning_rate` is required to be large\n",
@@ -196,8 +252,8 @@
196252
"cell_type": "markdown",
197253
"metadata": {},
198254
"source": [
199-
"Now we estimate the generalization performance of the best model\n",
200-
"using the test set."
255+
"Now we estimate the generalization performance of the best model using the\n",
256+
"test set."
201257
]
202258
},
203259
{
@@ -216,8 +272,8 @@
216272
"source": [
217273
"The mean test score in the held-out test set is slightly better than the score\n",
218274
"of the best model. The reason is that the final model is refitted on the whole\n",
219-
"training set and therefore, on more data than the inner cross-validated models\n",
220-
"of the grid search procedure."
275+
"training set and therefore, on more data than the cross-validated models of\n",
276+
"the grid search procedure."
221277
]
222278
}
223279
],

0 commit comments

Comments
 (0)