File tree Expand file tree Collapse file tree 2 files changed +12
-6
lines changed Expand file tree Collapse file tree 2 files changed +12
-6
lines changed Original file line number Diff line number Diff line change 188
188
# We observe that adding new samples to the training dataset does not seem to
189
189
# improve the training and testing scores. In particular, the testing score
190
190
# oscillates around 76% accuracy. Indeed, ~76% of the samples belong to the
191
- # class `"not donated"`` . Notice then that a classifier that always predicts the
192
- # `"not donated"`` class would achieve an accuracy of 76% without using any
191
+ # class `"not donated"`. Notice then that a classifier that always predicts the
192
+ # `"not donated"` class would achieve an accuracy of 76% without using any
193
193
# information from the data itself. This can mean that our small pipeline is not
194
194
# able to use the input features to improve upon that simplistic baseline, and
195
195
# increasing the training set size does not help either.
Original file line number Diff line number Diff line change @@ -83,10 +83,16 @@ def generate_data(n_samples=30):
83
83
#
84
84
# ## Bootstrap resampling
85
85
#
86
- # A bootstrap sample corresponds to a resampling with replacement, of the
87
- # original dataset, a sample that is the same size as the original dataset.
88
- # Thus, the bootstrap sample will contain some data points several times while
89
- # some of the original data points will not be present.
86
+ # Bootstrapping is a resampling "with replacement" of the original
87
+ # dataset. It corresponds to sampling n out of n data points with
88
+ # replacement uniformly at random from the original dataset. n is the
89
+ # number of data points in the original dataset.
90
+ #
91
+ # As a result, the output of the bootstrap sampling procedure is another
92
+ # dataset with also n data points, but likely with duplicates. As a consequence,
93
+ # there are also data points from the original dataset that are never selected to
94
+ # appear in a bootstrap sample (by chance). Those data points that are left away
95
+ # are often referred to as the out-of-bag sample.
90
96
#
91
97
# We will create a function that given `data` and `target` will return a
92
98
# resampled variation `data_bootstrap` and `target_bootstrap`.
You can’t perform that action at this time.
0 commit comments