|
27 | 27 | # notebook. The target to be predicted is a continuous variable and not anymore
|
28 | 28 | # discrete. This task is called regression.
|
29 | 29 | #
|
30 |
| -# Therefore, we will use predictive model specific to regression and not to |
| 30 | +# This, we will use a predictive model specific to regression and not to |
31 | 31 | # classification.
|
32 | 32 |
|
33 | 33 | # %%
|
|
173 | 173 | # record their statistical performance on each variant of the test set.
|
174 | 174 | #
|
175 | 175 | # To evaluate the statistical performance of our regressor, we can use
|
176 |
| -# `cross_validate` with a `ShuffleSplit` object: |
| 176 | +# [`sklearn.model_selection.cross_validate`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html) |
| 177 | +# with a |
| 178 | +# [`sklearn.model_selection.ShuffleSplit`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ShuffleSplit.html) |
| 179 | +# object: |
177 | 180 |
|
178 | 181 | # %%
|
179 | 182 | from sklearn.model_selection import cross_validate
|
|
221 | 224 | cv_results.head(10)
|
222 | 225 |
|
223 | 226 | # %% [markdown]
|
224 |
| -# We get timing information to fit and predict at each round of |
225 |
| -# cross-validation. Also, we get the test score, which corresponds to the |
226 |
| -# testing error on each of the split. |
| 227 | +# We get timing information to fit and predict at each cross-validation |
| 228 | +# iteration. Also, we get the test score, which corresponds to the testing |
| 229 | +# error on each of the splits. |
227 | 230 |
|
228 | 231 | # %%
|
229 | 232 | len(cv_results)
|
|
258 | 261 | # 46.36 +/- 1.17 k\$.
|
259 | 262 | #
|
260 | 263 | # If we were to train a single model on the full dataset (without
|
261 |
| -# cross-validation) and then had later access to an unlimited amount of test |
| 264 | +# cross-validation) and then later had access to an unlimited amount of test |
262 | 265 | # data, we would expect its true testing error to fall close to that
|
263 | 266 | # region.
|
264 | 267 | #
|
|
281 | 284 | #
|
282 | 285 | # We notice that the mean estimate of the testing error obtained by
|
283 | 286 | # cross-validation is a bit smaller than the natural scale of variation of the
|
284 |
| -# target variable. Furthermore the standard deviation of the cross validation |
| 287 | +# target variable. Furthermore, the standard deviation of the cross validation |
285 | 288 | # estimate of the testing error is even smaller.
|
286 | 289 | #
|
287 | 290 | # This is a good start, but not necessarily enough to decide whether the
|
|
298 | 301 | # mean absolute percentage error would have been a much better choice.
|
299 | 302 | #
|
300 | 303 | # But in all cases, an error of 47 k\$ might be too large to automatically use
|
301 |
| -# our model to tag house value without expert supervision. |
| 304 | +# our model to tag house values without expert supervision. |
302 | 305 | #
|
303 | 306 | # ## More detail regarding `cross_validate`
|
304 | 307 | #
|
305 | 308 | # During cross-validation, many models are trained and evaluated. Indeed, the
|
306 | 309 | # number of elements in each array of the output of `cross_validate` is a
|
307 |
| -# result from one of this `fit`/`score`. To make it explicit, it is possible |
308 |
| -# to retrieve theses fitted models for each of the fold by passing the option |
309 |
| -# `return_estimator=True` in `cross_validate`. |
| 310 | +# result from one of these `fit`/`score` procedures. To make it explicit, it is |
| 311 | +# possible to retrieve theses fitted models for each of the splits/folds by |
| 312 | +# passing the option `return_estimator=True` in `cross_validate`. |
310 | 313 |
|
311 | 314 | # %%
|
312 | 315 | cv_results = cross_validate(regressor, data, target, return_estimator=True)
|
|
321 | 324 | # because it allows to inspect the internal fitted parameters of these
|
322 | 325 | # regressors.
|
323 | 326 | #
|
324 |
| -# In the case where you are interested only about the test score, scikit-learn |
| 327 | +# In the case where you only are interested in the test score, scikit-learn |
325 | 328 | # provide a `cross_val_score` function. It is identical to calling the
|
326 | 329 | # `cross_validate` function and to select the `test_score` only (as we
|
327 | 330 | # extensively did in the previous notebooks).
|
|
0 commit comments