|
17 | 17 | "<div class=\"admonition caution alert alert-warning\">\n",
|
18 | 18 | "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Caution!</p>\n",
|
19 | 19 | "<p class=\"last\">For the sake of clarity, no cross-validation will be used to estimate the\n",
|
20 |
| - "testing error. We are only showing the effect of the parameters\n", |
21 |
| - "on the validation set of what should be the inner cross-validation.</p>\n", |
| 20 | + "variability of the testing error. We are only showing the effect of the\n", |
| 21 | + "parameters on the validation set of what should be the inner loop of a nested\n", |
| 22 | + "cross-validation.</p>\n", |
22 | 23 | "</div>\n",
|
23 | 24 | "\n",
|
24 |
| - "## Random forest\n", |
25 |
| - "\n", |
26 |
| - "The main parameter to tune for random forest is the `n_estimators` parameter.\n", |
27 |
| - "In general, the more trees in the forest, the better the generalization\n", |
28 |
| - "performance will be. However, it will slow down the fitting and prediction\n", |
29 |
| - "time. The goal is to balance computing time and generalization performance when\n", |
30 |
| - "setting the number of estimators when putting such learner in production.\n", |
31 |
| - "\n", |
32 |
| - "Then, we could also tune a parameter that controls the depth of each tree in\n", |
33 |
| - "the forest. Two parameters are important for this: `max_depth` and\n", |
34 |
| - "`max_leaf_nodes`. They differ in the way they control the tree structure.\n", |
35 |
| - "Indeed, `max_depth` will enforce to have a more symmetric tree, while\n", |
36 |
| - "`max_leaf_nodes` does not impose such constraint.\n", |
37 |
| - "\n", |
38 |
| - "Be aware that with random forest, trees are generally deep since we are\n", |
39 |
| - "seeking to overfit each tree on each bootstrap sample because this will be\n", |
40 |
| - "mitigated by combining them altogether. Assembling underfitted trees (i.e.\n", |
41 |
| - "shallow trees) might also lead to an underfitted forest." |
| 25 | + "We will start by loading the california housing dataset." |
42 | 26 | ]
|
43 | 27 | },
|
44 | 28 | {
|
|
56 | 40 | " data, target, random_state=0)"
|
57 | 41 | ]
|
58 | 42 | },
|
| 43 | + { |
| 44 | + "cell_type": "markdown", |
| 45 | + "metadata": {}, |
| 46 | + "source": [ |
| 47 | + "## Random forest\n", |
| 48 | + "\n", |
| 49 | + "The main parameter to select in random forest is the `n_estimators` parameter.\n", |
| 50 | + "In general, the more trees in the forest, the better the generalization\n", |
| 51 | + "performance will be. However, it will slow down the fitting and prediction\n", |
| 52 | + "time. The goal is to balance computing time and generalization performance\n", |
| 53 | + "when setting the number of estimators. Here, we fix `n_estimators=100`, which\n", |
| 54 | + "is already the default value.\n", |
| 55 | + "\n", |
| 56 | + "<div class=\"admonition caution alert alert-warning\">\n", |
| 57 | + "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Caution!</p>\n", |
| 58 | + "<p class=\"last\">Tuning the <tt class=\"docutils literal\">n_estimators</tt> for random forests generally result in a waste of\n", |
| 59 | + "computer power. We just need to ensure that it is large enough so that doubling\n", |
| 60 | + "its value does not lead to a significant improvement of the validation error.</p>\n", |
| 61 | + "</div>\n", |
| 62 | + "\n", |
| 63 | + "Instead, we can tune the hyperparameter `max_features`, which controls the\n", |
| 64 | + "size of the random subset of features to consider when looking for the best\n", |
| 65 | + "split when growing the trees: smaller values for `max_features` will lead to\n", |
| 66 | + "more random trees with hopefully more uncorrelated prediction errors. However\n", |
| 67 | + "if `max_features` is too small, predictions can be too random, even after\n", |
| 68 | + "averaging with the trees in the ensemble.\n", |
| 69 | + "\n", |
| 70 | + "If `max_features` is set to `None`, then this is equivalent to setting\n", |
| 71 | + "`max_features=n_features` which means that the only source of randomness in\n", |
| 72 | + "the random forest is the bagging procedure." |
| 73 | + ] |
| 74 | + }, |
| 75 | + { |
| 76 | + "cell_type": "code", |
| 77 | + "execution_count": null, |
| 78 | + "metadata": {}, |
| 79 | + "outputs": [], |
| 80 | + "source": [ |
| 81 | + "print(f\"In this case, n_features={len(data.columns)}\")" |
| 82 | + ] |
| 83 | + }, |
| 84 | + { |
| 85 | + "cell_type": "markdown", |
| 86 | + "metadata": {}, |
| 87 | + "source": [ |
| 88 | + "We can also tune the different parameters that control the depth of each tree\n", |
| 89 | + "in the forest. Two parameters are important for this: `max_depth` and\n", |
| 90 | + "`max_leaf_nodes`. They differ in the way they control the tree structure.\n", |
| 91 | + "Indeed, `max_depth` will enforce to have a more symmetric tree, while\n", |
| 92 | + "`max_leaf_nodes` does not impose such constraint. If `max_leaf_nodes=None`\n", |
| 93 | + "then the number of leaf nodes is unlimited.\n", |
| 94 | + "\n", |
| 95 | + "The hyperparameter `min_samples_leaf` controls the minimum number of samples\n", |
| 96 | + "required to be at a leaf node. This means that a split point (at any depth) is\n", |
| 97 | + "only done if it leaves at least `min_samples_leaf` training samples in each of\n", |
| 98 | + "the left and right branches. A small value for `min_samples_leaf` means that\n", |
| 99 | + "some samples can become isolated when a tree is deep, promoting overfitting. A\n", |
| 100 | + "large value would prevent deep trees, which can lead to underfitting.\n", |
| 101 | + "\n", |
| 102 | + "Be aware that with random forest, trees are expected to be deep since we are\n", |
| 103 | + "seeking to overfit each tree on each bootstrap sample. Overfitting is\n", |
| 104 | + "mitigated when combining the trees altogether, whereas assembling underfitted\n", |
| 105 | + "trees (i.e. shallow trees) might also lead to an underfitted forest." |
| 106 | + ] |
| 107 | + }, |
59 | 108 | {
|
60 | 109 | "cell_type": "code",
|
61 | 110 | "execution_count": null,
|
|
67 | 116 | "from sklearn.ensemble import RandomForestRegressor\n",
|
68 | 117 | "\n",
|
69 | 118 | "param_distributions = {\n",
|
70 |
| - " \"n_estimators\": [1, 2, 5, 10, 20, 50, 100, 200, 500],\n", |
71 |
| - " \"max_leaf_nodes\": [2, 5, 10, 20, 50, 100],\n", |
| 119 | + " \"max_features\": [1, 2, 3, 5, None],\n", |
| 120 | + " \"max_leaf_nodes\": [10, 100, 1000, None],\n", |
| 121 | + " \"min_samples_leaf\": [1, 2, 5, 10, 20, 50, 100],\n", |
72 | 122 | "}\n",
|
73 | 123 | "search_cv = RandomizedSearchCV(\n",
|
74 | 124 | " RandomForestRegressor(n_jobs=2), param_distributions=param_distributions,\n",
|
|
88 | 138 | "cell_type": "markdown",
|
89 | 139 | "metadata": {},
|
90 | 140 | "source": [
|
91 |
| - "We can observe in our search that we are required to have a large\n", |
92 |
| - "number of leaves and thus deep trees. This parameter seems particularly\n", |
93 |
| - "impactful in comparison to the number of trees for this particular dataset:\n", |
94 |
| - "with at least 50 trees, the generalization performance will be driven by the\n", |
95 |
| - "number of leaves.\n", |
96 |
| - "\n", |
97 |
| - "Now we will estimate the generalization performance of the best model by\n", |
98 |
| - "refitting it with the full training set and using the test set for scoring on\n", |
99 |
| - "unseen data. This is done by default when calling the `.fit` method." |
| 141 | + "We can observe in our search that we are required to have a large number of\n", |
| 142 | + "`max_leaf_nodes` and thus deep trees. This parameter seems particularly\n", |
| 143 | + "impactful with respect to the other tuning parameters, but large values of\n", |
| 144 | + "`min_samples_leaf` seem to reduce the performance of the model.\n", |
| 145 | + "\n", |
| 146 | + "In practice, more iterations of random search would be necessary to precisely\n", |
| 147 | + "assert the role of each parameters. Using `n_iter=10` is good enough to\n", |
| 148 | + "quickly inspect the hyperparameter combinations that yield models that work\n", |
| 149 | + "well enough without spending too much computational resources. Feel free to\n", |
| 150 | + "try more interations on your own.\n", |
| 151 | + "\n", |
| 152 | + "Once the `RandomizedSearchCV` has found the best set of hyperparameters, it\n", |
| 153 | + "uses them to refit the model using the full training set. To estimate the\n", |
| 154 | + "generalization performance of the best model it suffices to call `.score` on\n", |
| 155 | + "the unseen data." |
100 | 156 | ]
|
101 | 157 | },
|
102 | 158 | {
|
|
180 | 236 | "\n",
|
181 | 237 | "<div class=\"admonition caution alert alert-warning\">\n",
|
182 | 238 | "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Caution!</p>\n",
|
183 |
| - "<p class=\"last\">Here, we tune the <tt class=\"docutils literal\">n_estimators</tt> but be aware that using early-stopping as\n", |
184 |
| - "in the previous exercise will be better.</p>\n", |
| 239 | + "<p class=\"last\">Here, we tune the <tt class=\"docutils literal\">n_estimators</tt> but be aware that is better to use\n", |
| 240 | + "<tt class=\"docutils literal\">early_stopping</tt> as done in the Exercise M6.04.</p>\n", |
185 | 241 | "</div>\n",
|
186 | 242 | "\n",
|
187 | 243 | "In this search, we see that the `learning_rate` is required to be large\n",
|
|
196 | 252 | "cell_type": "markdown",
|
197 | 253 | "metadata": {},
|
198 | 254 | "source": [
|
199 |
| - "Now we estimate the generalization performance of the best model\n", |
200 |
| - "using the test set." |
| 255 | + "Now we estimate the generalization performance of the best model using the\n", |
| 256 | + "test set." |
201 | 257 | ]
|
202 | 258 | },
|
203 | 259 | {
|
|
216 | 272 | "source": [
|
217 | 273 | "The mean test score in the held-out test set is slightly better than the score\n",
|
218 | 274 | "of the best model. The reason is that the final model is refitted on the whole\n",
|
219 |
| - "training set and therefore, on more data than the inner cross-validated models\n", |
220 |
| - "of the grid search procedure." |
| 275 | + "training set and therefore, on more data than the cross-validated models of\n", |
| 276 | + "the grid search procedure." |
221 | 277 | ]
|
222 | 278 | }
|
223 | 279 | ],
|
|
0 commit comments