Skip to content

Commit c254e90

Browse files
authored
Emphasize good practice of not tuning n_estimators (INRIA#658)
1 parent 097dea8 commit c254e90

File tree

1 file changed

+77
-36
lines changed

1 file changed

+77
-36
lines changed

python_scripts/ensemble_hyperparameters.py

Lines changed: 77 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -18,28 +18,12 @@
1818
#
1919
# ```{caution}
2020
# For the sake of clarity, no cross-validation will be used to estimate the
21-
# testing error. We are only showing the effect of the parameters
22-
# on the validation set of what should be the inner cross-validation.
21+
# variability of the testing error. We are only showing the effect of the
22+
# parameters on the validation set of what should be the inner loop of a nested
23+
# cross-validation.
2324
# ```
2425
#
25-
# ## Random forest
26-
#
27-
# The main parameter to tune for random forest is the `n_estimators` parameter.
28-
# In general, the more trees in the forest, the better the generalization
29-
# performance will be. However, it will slow down the fitting and prediction
30-
# time. The goal is to balance computing time and generalization performance when
31-
# setting the number of estimators when putting such learner in production.
32-
#
33-
# Then, we could also tune a parameter that controls the depth of each tree in
34-
# the forest. Two parameters are important for this: `max_depth` and
35-
# `max_leaf_nodes`. They differ in the way they control the tree structure.
36-
# Indeed, `max_depth` will enforce to have a more symmetric tree, while
37-
# `max_leaf_nodes` does not impose such constraint.
38-
#
39-
# Be aware that with random forest, trees are generally deep since we are
40-
# seeking to overfit each tree on each bootstrap sample because this will be
41-
# mitigated by combining them altogether. Assembling underfitted trees (i.e.
42-
# shallow trees) might also lead to an underfitted forest.
26+
# We will start by loading the california housing dataset.
4327

4428
# %%
4529
from sklearn.datasets import fetch_california_housing
@@ -50,14 +34,65 @@
5034
data_train, data_test, target_train, target_test = train_test_split(
5135
data, target, random_state=0)
5236

37+
# %% [markdown]
38+
# ## Random forest
39+
#
40+
# The main parameter to select in random forest is the `n_estimators` parameter.
41+
# In general, the more trees in the forest, the better the generalization
42+
# performance will be. However, it will slow down the fitting and prediction
43+
# time. The goal is to balance computing time and generalization performance
44+
# when setting the number of estimators. Here, we fix `n_estimators=100`, which
45+
# is already the default value.
46+
#
47+
# ```{caution}
48+
# Tuning the `n_estimators` for random forests generally result in a waste of
49+
# computer power. We just need to ensure that it is large enough so that doubling
50+
# its value does not lead to a significant improvement of the validation error.
51+
# ```
52+
#
53+
# Instead, we can tune the hyperparameter `max_features`, which controls the
54+
# size of the random subset of features to consider when looking for the best
55+
# split when growing the trees: smaller values for `max_features` will lead to
56+
# more random trees with hopefully more uncorrelated prediction errors. However
57+
# if `max_features` is too small, predictions can be too random, even after
58+
# averaging with the trees in the ensemble.
59+
#
60+
# If `max_features` is set to `None`, then this is equivalent to setting
61+
# `max_features=n_features` which means that the only source of randomness in
62+
# the random forest is the bagging procedure.
63+
64+
# %%
65+
print(f"In this case, n_features={len(data.columns)}")
66+
67+
# %% [markdown]
68+
# We can also tune the different parameters that control the depth of each tree
69+
# in the forest. Two parameters are important for this: `max_depth` and
70+
# `max_leaf_nodes`. They differ in the way they control the tree structure.
71+
# Indeed, `max_depth` will enforce to have a more symmetric tree, while
72+
# `max_leaf_nodes` does not impose such constraint. If `max_leaf_nodes=None`
73+
# then the number of leaf nodes is unlimited.
74+
#
75+
# The hyperparameter `min_samples_leaf` controls the minimum number of samples
76+
# required to be at a leaf node. This means that a split point (at any depth) is
77+
# only done if it leaves at least `min_samples_leaf` training samples in each of
78+
# the left and right branches. A small value for `min_samples_leaf` means that
79+
# some samples can become isolated when a tree is deep, promoting overfitting. A
80+
# large value would prevent deep trees, which can lead to underfitting.
81+
#
82+
# Be aware that with random forest, trees are expected to be deep since we are
83+
# seeking to overfit each tree on each bootstrap sample. Overfitting is
84+
# mitigated when combining the trees altogether, whereas assembling underfitted
85+
# trees (i.e. shallow trees) might also lead to an underfitted forest.
86+
5387
# %%
5488
import pandas as pd
5589
from sklearn.model_selection import RandomizedSearchCV
5690
from sklearn.ensemble import RandomForestRegressor
5791

5892
param_distributions = {
59-
"n_estimators": [1, 2, 5, 10, 20, 50, 100, 200, 500],
60-
"max_leaf_nodes": [2, 5, 10, 20, 50, 100],
93+
"max_features": [1, 2, 3, 5, None],
94+
"max_leaf_nodes": [10, 100, 1000, None],
95+
"min_samples_leaf": [1, 2, 5, 10, 20, 50, 100],
6196
}
6297
search_cv = RandomizedSearchCV(
6398
RandomForestRegressor(n_jobs=2), param_distributions=param_distributions,
@@ -73,15 +108,21 @@
73108
cv_results[columns].sort_values(by="mean_test_error")
74109

75110
# %% [markdown]
76-
# We can observe in our search that we are required to have a large
77-
# number of leaves and thus deep trees. This parameter seems particularly
78-
# impactful in comparison to the number of trees for this particular dataset:
79-
# with at least 50 trees, the generalization performance will be driven by the
80-
# number of leaves.
111+
# We can observe in our search that we are required to have a large number of
112+
# `max_leaf_nodes` and thus deep trees. This parameter seems particularly
113+
# impactful with respect to the other tuning parameters, but large values of
114+
# `min_samples_leaf` seem to reduce the performance of the model.
115+
#
116+
# In practice, more iterations of random search would be necessary to precisely
117+
# assert the role of each parameters. Using `n_iter=10` is good enough to
118+
# quickly inspect the hyperparameter combinations that yield models that work
119+
# well enough without spending too much computational resources. Feel free to
120+
# try more interations on your own.
81121
#
82-
# Now we will estimate the generalization performance of the best model by
83-
# refitting it with the full training set and using the test set for scoring on
84-
# unseen data. This is done by default when calling the `.fit` method.
122+
# Once the `RandomizedSearchCV` has found the best set of hyperparameters, it
123+
# uses them to refit the model using the full training set. To estimate the
124+
# generalization performance of the best model it suffices to call `.score` on
125+
# the unseen data.
85126

86127
# %%
87128
error = -search_cv.score(data_test, target_test)
@@ -144,8 +185,8 @@
144185
# %% [markdown]
145186
#
146187
# ```{caution}
147-
# Here, we tune the `n_estimators` but be aware that using early-stopping as
148-
# in the previous exercise will be better.
188+
# Here, we tune the `n_estimators` but be aware that is better to use
189+
# `early_stopping` as done in the Exercise M6.04.
149190
# ```
150191
#
151192
# In this search, we see that the `learning_rate` is required to be large
@@ -156,8 +197,8 @@
156197
# on the other hyperparameter values.
157198

158199
# %% [markdown]
159-
# Now we estimate the generalization performance of the best model
160-
# using the test set.
200+
# Now we estimate the generalization performance of the best model using the
201+
# test set.
161202

162203
# %%
163204
error = -search_cv.score(data_test, target_test)
@@ -166,5 +207,5 @@
166207
# %% [markdown]
167208
# The mean test score in the held-out test set is slightly better than the score
168209
# of the best model. The reason is that the final model is refitted on the whole
169-
# training set and therefore, on more data than the inner cross-validated models
170-
# of the grid search procedure.
210+
# training set and therefore, on more data than the cross-validated models of
211+
# the grid search procedure.

0 commit comments

Comments
 (0)