Skip to content

Commit c393a00

Browse files
ArturoAmorQogrisel
andauthored
Improve wording in definition of bootstrap resampling (INRIA#617)
Co-authored-by: Olivier Grisel <[email protected]>
1 parent 2ed9ec1 commit c393a00

File tree

2 files changed

+12
-6
lines changed

2 files changed

+12
-6
lines changed

python_scripts/cross_validation_sol_01.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -188,8 +188,8 @@
188188
# We observe that adding new samples to the training dataset does not seem to
189189
# improve the training and testing scores. In particular, the testing score
190190
# oscillates around 76% accuracy. Indeed, ~76% of the samples belong to the
191-
# class `"not donated"``. Notice then that a classifier that always predicts the
192-
# `"not donated"`` class would achieve an accuracy of 76% without using any
191+
# class `"not donated"`. Notice then that a classifier that always predicts the
192+
# `"not donated"` class would achieve an accuracy of 76% without using any
193193
# information from the data itself. This can mean that our small pipeline is not
194194
# able to use the input features to improve upon that simplistic baseline, and
195195
# increasing the training set size does not help either.

python_scripts/ensemble_bagging.py

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -83,10 +83,16 @@ def generate_data(n_samples=30):
8383
#
8484
# ## Bootstrap resampling
8585
#
86-
# A bootstrap sample corresponds to a resampling with replacement, of the
87-
# original dataset, a sample that is the same size as the original dataset.
88-
# Thus, the bootstrap sample will contain some data points several times while
89-
# some of the original data points will not be present.
86+
# Bootstrapping is a resampling "with replacement" of the original
87+
# dataset. It corresponds to sampling n out of n data points with
88+
# replacement uniformly at random from the original dataset. n is the
89+
# number of data points in the original dataset.
90+
#
91+
# As a result, the output of the bootstrap sampling procedure is another
92+
# dataset with also n data points, but likely with duplicates. As a consequence,
93+
# there are also data points from the original dataset that are never selected to
94+
# appear in a bootstrap sample (by chance). Those data points that are left away
95+
# are often referred to as the out-of-bag sample.
9096
#
9197
# We will create a function that given `data` and `target` will return a
9298
# resampled variation `data_bootstrap` and `target_bootstrap`.

0 commit comments

Comments
 (0)