Skip to content

Commit b3427e7

Browse files
authored
Update notebooks (INRIA#616)
1 parent f31710b commit b3427e7

File tree

3 files changed

+40
-44
lines changed

3 files changed

+40
-44
lines changed

notebooks/cross_validation_grouping.ipynb

Lines changed: 19 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -26,8 +26,8 @@
2626
"cell_type": "markdown",
2727
"metadata": {},
2828
"source": [
29-
"We will recreate the same model used in the previous exercise:\n",
30-
"a logistic regression classifier with preprocessor to scale the data."
29+
"We will recreate the same model used in the previous notebook:\n",
30+
"a logistic regression classifier with a preprocessor to scale the data."
3131
]
3232
},
3333
{
@@ -138,9 +138,9 @@
138138
"cell_type": "markdown",
139139
"metadata": {},
140140
"source": [
141-
"The cross-validation testing error that uses the shuffling has less\n",
142-
"variance than the one that does not impose any shuffling. It means that some\n",
143-
"specific fold leads to a low score in this case."
141+
"The cross-validation testing error that uses the shuffling has less variance\n",
142+
"than the one that does not impose any shuffling. It means that some specific\n",
143+
"fold leads to a low score in this case."
144144
]
145145
},
146146
{
@@ -176,11 +176,11 @@
176176
"source": [
177177
"If we read carefully, 13 writers wrote the digits of our dataset, accounting\n",
178178
"for a total amount of 1797 samples. Thus, a writer wrote several times the\n",
179-
"same numbers. Let's suppose that the writer samples are grouped.\n",
180-
"Subsequently, not shuffling the data will keep all writer samples together\n",
181-
"either in the training or the testing sets. Mixing the data will break this\n",
182-
"structure, and therefore digits written by the same writer will be available\n",
183-
"in both the training and testing sets.\n",
179+
"same numbers. Let's suppose that the writer samples are grouped. Subsequently,\n",
180+
"not shuffling the data will keep all writer samples together either in the\n",
181+
"training or the testing sets. Mixing the data will break this structure, and\n",
182+
"therefore digits written by the same writer will be available in both the\n",
183+
"training and testing sets.\n",
184184
"\n",
185185
"Besides, a writer will usually tend to write digits in the same manner. Thus,\n",
186186
"our model will learn to identify a writer's pattern for each digit instead of\n",
@@ -209,14 +209,14 @@
209209
"\n",
210210
"It might not be obvious at first, but there is a structure in the target:\n",
211211
"there is a repetitive pattern that always starts by some series of ordered\n",
212-
"digits from 0 to 9 followed by random digits at a certain point. If we look\n",
213-
"in details, we see that there is 14 such patterns, always with around 130\n",
214-
"samples each.\n",
212+
"digits from 0 to 9 followed by random digits at a certain point. If we look in\n",
213+
"detail, we see that there are 14 such patterns, always with around 130 samples\n",
214+
"each.\n",
215215
"\n",
216-
"Even if it is not exactly corresponding to the 13 writers in the\n",
217-
"documentation (maybe one writer wrote two series of digits), we can\n",
218-
"make the hypothesis that each of these patterns corresponds to a different\n",
219-
"writer and thus a different group."
216+
"Even if it is not exactly corresponding to the 13 writers in the documentation\n",
217+
"(maybe one writer wrote two series of digits), we can make the hypothesis that\n",
218+
"each of these patterns corresponds to a different writer and thus a different\n",
219+
"group."
220220
]
221221
},
222222
{
@@ -326,8 +326,8 @@
326326
"metadata": {},
327327
"source": [
328328
"As a conclusion, it is really important to take any sample grouping pattern\n",
329-
"into account when evaluating a model. Otherwise, the results obtained will\n",
330-
"be over-optimistic in regards with reality."
329+
"into account when evaluating a model. Otherwise, the results obtained will be\n",
330+
"over-optimistic in regards with reality."
331331
]
332332
}
333333
],

notebooks/cross_validation_time.ipynb

Lines changed: 19 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -20,8 +20,8 @@
2020
"depends on past information.\n",
2121
"\n",
2222
"We will take an example to highlight such issues with non-i.i.d. data in the\n",
23-
"previous cross-validation strategies presented. We are going to load\n",
24-
"financial quotations from some energy companies."
23+
"previous cross-validation strategies presented. We are going to load financial\n",
24+
"quotations from some energy companies."
2525
]
2626
},
2727
{
@@ -70,14 +70,9 @@
7070
"cell_type": "markdown",
7171
"metadata": {},
7272
"source": [
73-
"We will repeat the experiment asked during the exercise. Instead of using\n",
74-
"random data, we will use real quotations this time. While it was obvious that\n",
75-
"a predictive model could not work in practice on random data, this is the\n",
76-
"same on these real data. So here, we want to predict the quotation of Chevron\n",
77-
"using all other energy companies' quotes.\n",
78-
"\n",
79-
"To make explanatory plots, we will use a single split in addition to the\n",
80-
"cross-validation that you used in the introductory exercise."
73+
"Here, we want to predict the quotation of Chevron using all other energy\n",
74+
"companies' quotes. To make explanatory plots, we first use a train-test split\n",
75+
"and then we evaluate other cross-validation methods."
8176
]
8277
},
8378
{
@@ -158,9 +153,9 @@
158153
"cell_type": "markdown",
159154
"metadata": {},
160155
"source": [
161-
"Surprisingly, we get outstanding generalization performance. We will investigate\n",
162-
"and find the reason for such good results with a model that is expected to\n",
163-
"fail. We previously mentioned that `ShuffleSplit` is an iterative\n",
156+
"Surprisingly, we get outstanding generalization performance. We will\n",
157+
"investigate and find the reason for such good results with a model that is\n",
158+
"expected to fail. We previously mentioned that `ShuffleSplit` is an iterative\n",
164159
"cross-validation scheme that shuffles data and split. We will simplify this\n",
165160
"procedure with a single split and plot the prediction. We can use\n",
166161
"`train_test_split` for this purpose."
@@ -228,10 +223,10 @@
228223
"testing. But we can also see that the testing samples are next to some\n",
229224
"training sample. And with these time-series, we see a relationship between a\n",
230225
"sample at the time `t` and a sample at `t+1`. In this case, we are violating\n",
231-
"the i.i.d. assumption. The insight to get is the following: a model can\n",
232-
"output of its training set at the time `t` for a testing sample at the time\n",
233-
"`t+1`. This prediction would be close to the true value even if our model\n",
234-
"did not learn anything, but just memorized the training dataset.\n",
226+
"the i.i.d. assumption. The insight to get is the following: a model can output\n",
227+
"of its training set at the time `t` for a testing sample at the time `t+1`.\n",
228+
"This prediction would be close to the true value even if our model did not\n",
229+
"learn anything, but just memorized the training dataset.\n",
235230
"\n",
236231
"An easy way to verify this hypothesis is to not shuffle the data when doing\n",
237232
"the split. In this case, we will use the first 75% of the data to train and\n",
@@ -292,7 +287,8 @@
292287
"source": [
293288
"We see that our model cannot predict anything because it doesn't have samples\n",
294289
"around the testing sample. Let's check how we could have made a proper\n",
295-
"cross-validation scheme to get a reasonable generalization performance estimate.\n",
290+
"cross-validation scheme to get a reasonable generalization performance\n",
291+
"estimate.\n",
296292
"\n",
297293
"One solution would be to group the samples into time blocks, e.g. by quarter,\n",
298294
"and predict each group's information by using information from the other\n",
@@ -324,8 +320,8 @@
324320
"\n",
325321
"Another thing to consider is the actual application of our solution. If our\n",
326322
"model is aimed at forecasting (i.e., predicting future data from past data),\n",
327-
"we should not use training data that are ulterior to the testing data. In\n",
328-
"this case, we can use the `TimeSeriesSplit` cross-validation to enforce this\n",
323+
"we should not use training data that are ulterior to the testing data. In this\n",
324+
"case, we can use the `TimeSeriesSplit` cross-validation to enforce this\n",
329325
"behaviour."
330326
]
331327
},
@@ -349,9 +345,9 @@
349345
"metadata": {},
350346
"source": [
351347
"In conclusion, it is really important to not use an out of the shelves\n",
352-
"cross-validation strategy which do not respect some assumptions such as\n",
353-
"having i.i.d data. It might lead to absurd results which could make think\n",
354-
"that a predictive model might work."
348+
"cross-validation strategy which do not respect some assumptions such as having\n",
349+
"i.i.d data. It might lead to absurd results which could make think that a\n",
350+
"predictive model might work."
355351
]
356352
}
357353
],

notebooks/parameter_tuning_nested.ipynb

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -245,7 +245,7 @@
245245
"hyper-parameters and to train the refitted model.\n",
246246
"\n",
247247
"Because of the above, one must keep an external, held-out test set for the\n",
248-
"final evaluation the refitted model. We highlight here the process using a\n",
248+
"final evaluation of the refitted model. We highlight here the process using a\n",
249249
"single train-test split."
250250
]
251251
},
@@ -316,7 +316,7 @@
316316
"refitted tuned model.\n",
317317
"\n",
318318
"In practice, we only need to embed the grid-search in the function\n",
319-
"`cross-validate` to perform such evaluation."
319+
"`cross_validate` to perform such evaluation."
320320
]
321321
},
322322
{

0 commit comments

Comments
 (0)