|
18 | 18 | #
|
19 | 19 | # ```{caution}
|
20 | 20 | # For the sake of clarity, no cross-validation will be used to estimate the
|
21 |
| -# testing error. We are only showing the effect of the parameters |
22 |
| -# on the validation set of what should be the inner cross-validation. |
| 21 | +# variability of the testing error. We are only showing the effect of the |
| 22 | +# parameters on the validation set of what should be the inner loop of a nested |
| 23 | +# cross-validation. |
23 | 24 | # ```
|
24 | 25 | #
|
25 |
| -# ## Random forest |
26 |
| -# |
27 |
| -# The main parameter to tune for random forest is the `n_estimators` parameter. |
28 |
| -# In general, the more trees in the forest, the better the generalization |
29 |
| -# performance will be. However, it will slow down the fitting and prediction |
30 |
| -# time. The goal is to balance computing time and generalization performance when |
31 |
| -# setting the number of estimators when putting such learner in production. |
32 |
| -# |
33 |
| -# Then, we could also tune a parameter that controls the depth of each tree in |
34 |
| -# the forest. Two parameters are important for this: `max_depth` and |
35 |
| -# `max_leaf_nodes`. They differ in the way they control the tree structure. |
36 |
| -# Indeed, `max_depth` will enforce to have a more symmetric tree, while |
37 |
| -# `max_leaf_nodes` does not impose such constraint. |
38 |
| -# |
39 |
| -# Be aware that with random forest, trees are generally deep since we are |
40 |
| -# seeking to overfit each tree on each bootstrap sample because this will be |
41 |
| -# mitigated by combining them altogether. Assembling underfitted trees (i.e. |
42 |
| -# shallow trees) might also lead to an underfitted forest. |
| 26 | +# We will start by loading the california housing dataset. |
43 | 27 |
|
44 | 28 | # %%
|
45 | 29 | from sklearn.datasets import fetch_california_housing
|
|
50 | 34 | data_train, data_test, target_train, target_test = train_test_split(
|
51 | 35 | data, target, random_state=0)
|
52 | 36 |
|
| 37 | +# %% [markdown] |
| 38 | +# ## Random forest |
| 39 | +# |
| 40 | +# The main parameter to select in random forest is the `n_estimators` parameter. |
| 41 | +# In general, the more trees in the forest, the better the generalization |
| 42 | +# performance will be. However, it will slow down the fitting and prediction |
| 43 | +# time. The goal is to balance computing time and generalization performance |
| 44 | +# when setting the number of estimators. Here, we fix `n_estimators=100`, which |
| 45 | +# is already the default value. |
| 46 | +# |
| 47 | +# ```{caution} |
| 48 | +# Tuning the `n_estimators` for random forests generally result in a waste of |
| 49 | +# computer power. We just need to ensure that it is large enough so that doubling |
| 50 | +# its value does not lead to a significant improvement of the validation error. |
| 51 | +# ``` |
| 52 | +# |
| 53 | +# Instead, we can tune the hyperparameter `max_features`, which controls the |
| 54 | +# size of the random subset of features to consider when looking for the best |
| 55 | +# split when growing the trees: smaller values for `max_features` will lead to |
| 56 | +# more random trees with hopefully more uncorrelated prediction errors. However |
| 57 | +# if `max_features` is too small, predictions can be too random, even after |
| 58 | +# averaging with the trees in the ensemble. |
| 59 | +# |
| 60 | +# If `max_features` is set to `None`, then this is equivalent to setting |
| 61 | +# `max_features=n_features` which means that the only source of randomness in |
| 62 | +# the random forest is the bagging procedure. |
| 63 | + |
| 64 | +# %% |
| 65 | +print(f"In this case, n_features={len(data.columns)}") |
| 66 | + |
| 67 | +# %% [markdown] |
| 68 | +# We can also tune the different parameters that control the depth of each tree |
| 69 | +# in the forest. Two parameters are important for this: `max_depth` and |
| 70 | +# `max_leaf_nodes`. They differ in the way they control the tree structure. |
| 71 | +# Indeed, `max_depth` will enforce to have a more symmetric tree, while |
| 72 | +# `max_leaf_nodes` does not impose such constraint. If `max_leaf_nodes=None` |
| 73 | +# then the number of leaf nodes is unlimited. |
| 74 | +# |
| 75 | +# The hyperparameter `min_samples_leaf` controls the minimum number of samples |
| 76 | +# required to be at a leaf node. This means that a split point (at any depth) is |
| 77 | +# only done if it leaves at least `min_samples_leaf` training samples in each of |
| 78 | +# the left and right branches. A small value for `min_samples_leaf` means that |
| 79 | +# some samples can become isolated when a tree is deep, promoting overfitting. A |
| 80 | +# large value would prevent deep trees, which can lead to underfitting. |
| 81 | +# |
| 82 | +# Be aware that with random forest, trees are expected to be deep since we are |
| 83 | +# seeking to overfit each tree on each bootstrap sample. Overfitting is |
| 84 | +# mitigated when combining the trees altogether, whereas assembling underfitted |
| 85 | +# trees (i.e. shallow trees) might also lead to an underfitted forest. |
| 86 | + |
53 | 87 | # %%
|
54 | 88 | import pandas as pd
|
55 | 89 | from sklearn.model_selection import RandomizedSearchCV
|
56 | 90 | from sklearn.ensemble import RandomForestRegressor
|
57 | 91 |
|
58 | 92 | param_distributions = {
|
59 |
| - "n_estimators": [1, 2, 5, 10, 20, 50, 100, 200, 500], |
60 |
| - "max_leaf_nodes": [2, 5, 10, 20, 50, 100], |
| 93 | + "max_features": [1, 2, 3, 5, None], |
| 94 | + "max_leaf_nodes": [10, 100, 1000, None], |
| 95 | + "min_samples_leaf": [1, 2, 5, 10, 20, 50, 100], |
61 | 96 | }
|
62 | 97 | search_cv = RandomizedSearchCV(
|
63 | 98 | RandomForestRegressor(n_jobs=2), param_distributions=param_distributions,
|
|
73 | 108 | cv_results[columns].sort_values(by="mean_test_error")
|
74 | 109 |
|
75 | 110 | # %% [markdown]
|
76 |
| -# We can observe in our search that we are required to have a large |
77 |
| -# number of leaves and thus deep trees. This parameter seems particularly |
78 |
| -# impactful in comparison to the number of trees for this particular dataset: |
79 |
| -# with at least 50 trees, the generalization performance will be driven by the |
80 |
| -# number of leaves. |
| 111 | +# We can observe in our search that we are required to have a large number of |
| 112 | +# `max_leaf_nodes` and thus deep trees. This parameter seems particularly |
| 113 | +# impactful with respect to the other tuning parameters, but large values of |
| 114 | +# `min_samples_leaf` seem to reduce the performance of the model. |
| 115 | +# |
| 116 | +# In practice, more iterations of random search would be necessary to precisely |
| 117 | +# assert the role of each parameters. Using `n_iter=10` is good enough to |
| 118 | +# quickly inspect the hyperparameter combinations that yield models that work |
| 119 | +# well enough without spending too much computational resources. Feel free to |
| 120 | +# try more interations on your own. |
81 | 121 | #
|
82 |
| -# Now we will estimate the generalization performance of the best model by |
83 |
| -# refitting it with the full training set and using the test set for scoring on |
84 |
| -# unseen data. This is done by default when calling the `.fit` method. |
| 122 | +# Once the `RandomizedSearchCV` has found the best set of hyperparameters, it |
| 123 | +# uses them to refit the model using the full training set. To estimate the |
| 124 | +# generalization performance of the best model it suffices to call `.score` on |
| 125 | +# the unseen data. |
85 | 126 |
|
86 | 127 | # %%
|
87 | 128 | error = -search_cv.score(data_test, target_test)
|
|
144 | 185 | # %% [markdown]
|
145 | 186 | #
|
146 | 187 | # ```{caution}
|
147 |
| -# Here, we tune the `n_estimators` but be aware that using early-stopping as |
148 |
| -# in the previous exercise will be better. |
| 188 | +# Here, we tune the `n_estimators` but be aware that is better to use |
| 189 | +# `early_stopping` as done in the Exercise M6.04. |
149 | 190 | # ```
|
150 | 191 | #
|
151 | 192 | # In this search, we see that the `learning_rate` is required to be large
|
|
156 | 197 | # on the other hyperparameter values.
|
157 | 198 |
|
158 | 199 | # %% [markdown]
|
159 |
| -# Now we estimate the generalization performance of the best model |
160 |
| -# using the test set. |
| 200 | +# Now we estimate the generalization performance of the best model using the |
| 201 | +# test set. |
161 | 202 |
|
162 | 203 | # %%
|
163 | 204 | error = -search_cv.score(data_test, target_test)
|
|
166 | 207 | # %% [markdown]
|
167 | 208 | # The mean test score in the held-out test set is slightly better than the score
|
168 | 209 | # of the best model. The reason is that the final model is refitted on the whole
|
169 |
| -# training set and therefore, on more data than the inner cross-validated models |
170 |
| -# of the grid search procedure. |
| 210 | +# training set and therefore, on more data than the cross-validated models of |
| 211 | +# the grid search procedure. |
0 commit comments