Improve wording in definition of bootstrap resampling (INRIA#617)

ArturoAmorQ · ogrisel · web-flow · commit c393a00a95ff · 2022-04-05T18:06:31.000+02:00
Co-authored-by: Olivier Grisel &lt;olivier.grisel@ensta.org&gt;
diff --git a/python_scripts/cross_validation_sol_01.py b/python_scripts/cross_validation_sol_01.py
@@ -188,8 +188,8 @@
 # We observe that adding new samples to the training dataset does not seem to
 # improve the training and testing scores. In particular, the testing score
 # oscillates around 76% accuracy. Indeed, ~76% of the samples belong to the
-# class `"not donated"``. Notice then that a classifier that always predicts the
-# `"not donated"`` class would achieve an accuracy of 76% without using any
+# class `"not donated"`. Notice then that a classifier that always predicts the
+# `"not donated"` class would achieve an accuracy of 76% without using any
 # information from the data itself. This can mean that our small pipeline is not
 # able to use the input features to improve upon that simplistic baseline, and
 # increasing the training set size does not help either.
diff --git a/python_scripts/ensemble_bagging.py b/python_scripts/ensemble_bagging.py
@@ -83,10 +83,16 @@ def generate_data(n_samples=30):
 #
 # ## Bootstrap resampling
 #
-# A bootstrap sample corresponds to a resampling with replacement, of the
-# original dataset, a sample that is the same size as the original dataset.
-# Thus, the bootstrap sample will contain some data points several times while
-# some of the original data points will not be present.
+# Bootstrapping is a resampling "with replacement" of the original
+# dataset. It corresponds to sampling n out of n data points with
+# replacement uniformly at random from the original dataset. n is the
+# number of data points in the original dataset.
+#
+# As a result, the output of the bootstrap sampling procedure is another
+# dataset with also n data points, but likely with duplicates. As a consequence,
+# there are also data points from the original dataset that are never selected to
+# appear in a bootstrap sample (by chance). Those data points that are left away
+# are often referred to as the out-of-bag sample.
 #
 # We will create a function that given `data` and `target` will return a
 # resampled variation `data_bootstrap` and `target_bootstrap`.