|
20 | 20 | "depends on past information.\n",
|
21 | 21 | "\n",
|
22 | 22 | "We will take an example to highlight such issues with non-i.i.d. data in the\n",
|
23 |
| - "previous cross-validation strategies presented. We are going to load\n", |
24 |
| - "financial quotations from some energy companies." |
| 23 | + "previous cross-validation strategies presented. We are going to load financial\n", |
| 24 | + "quotations from some energy companies." |
25 | 25 | ]
|
26 | 26 | },
|
27 | 27 | {
|
|
70 | 70 | "cell_type": "markdown",
|
71 | 71 | "metadata": {},
|
72 | 72 | "source": [
|
73 |
| - "We will repeat the experiment asked during the exercise. Instead of using\n", |
74 |
| - "random data, we will use real quotations this time. While it was obvious that\n", |
75 |
| - "a predictive model could not work in practice on random data, this is the\n", |
76 |
| - "same on these real data. So here, we want to predict the quotation of Chevron\n", |
77 |
| - "using all other energy companies' quotes.\n", |
78 |
| - "\n", |
79 |
| - "To make explanatory plots, we will use a single split in addition to the\n", |
80 |
| - "cross-validation that you used in the introductory exercise." |
| 73 | + "Here, we want to predict the quotation of Chevron using all other energy\n", |
| 74 | + "companies' quotes. To make explanatory plots, we first use a train-test split\n", |
| 75 | + "and then we evaluate other cross-validation methods." |
81 | 76 | ]
|
82 | 77 | },
|
83 | 78 | {
|
|
158 | 153 | "cell_type": "markdown",
|
159 | 154 | "metadata": {},
|
160 | 155 | "source": [
|
161 |
| - "Surprisingly, we get outstanding generalization performance. We will investigate\n", |
162 |
| - "and find the reason for such good results with a model that is expected to\n", |
163 |
| - "fail. We previously mentioned that `ShuffleSplit` is an iterative\n", |
| 156 | + "Surprisingly, we get outstanding generalization performance. We will\n", |
| 157 | + "investigate and find the reason for such good results with a model that is\n", |
| 158 | + "expected to fail. We previously mentioned that `ShuffleSplit` is an iterative\n", |
164 | 159 | "cross-validation scheme that shuffles data and split. We will simplify this\n",
|
165 | 160 | "procedure with a single split and plot the prediction. We can use\n",
|
166 | 161 | "`train_test_split` for this purpose."
|
|
228 | 223 | "testing. But we can also see that the testing samples are next to some\n",
|
229 | 224 | "training sample. And with these time-series, we see a relationship between a\n",
|
230 | 225 | "sample at the time `t` and a sample at `t+1`. In this case, we are violating\n",
|
231 |
| - "the i.i.d. assumption. The insight to get is the following: a model can\n", |
232 |
| - "output of its training set at the time `t` for a testing sample at the time\n", |
233 |
| - "`t+1`. This prediction would be close to the true value even if our model\n", |
234 |
| - "did not learn anything, but just memorized the training dataset.\n", |
| 226 | + "the i.i.d. assumption. The insight to get is the following: a model can output\n", |
| 227 | + "of its training set at the time `t` for a testing sample at the time `t+1`.\n", |
| 228 | + "This prediction would be close to the true value even if our model did not\n", |
| 229 | + "learn anything, but just memorized the training dataset.\n", |
235 | 230 | "\n",
|
236 | 231 | "An easy way to verify this hypothesis is to not shuffle the data when doing\n",
|
237 | 232 | "the split. In this case, we will use the first 75% of the data to train and\n",
|
|
292 | 287 | "source": [
|
293 | 288 | "We see that our model cannot predict anything because it doesn't have samples\n",
|
294 | 289 | "around the testing sample. Let's check how we could have made a proper\n",
|
295 |
| - "cross-validation scheme to get a reasonable generalization performance estimate.\n", |
| 290 | + "cross-validation scheme to get a reasonable generalization performance\n", |
| 291 | + "estimate.\n", |
296 | 292 | "\n",
|
297 | 293 | "One solution would be to group the samples into time blocks, e.g. by quarter,\n",
|
298 | 294 | "and predict each group's information by using information from the other\n",
|
|
324 | 320 | "\n",
|
325 | 321 | "Another thing to consider is the actual application of our solution. If our\n",
|
326 | 322 | "model is aimed at forecasting (i.e., predicting future data from past data),\n",
|
327 |
| - "we should not use training data that are ulterior to the testing data. In\n", |
328 |
| - "this case, we can use the `TimeSeriesSplit` cross-validation to enforce this\n", |
| 323 | + "we should not use training data that are ulterior to the testing data. In this\n", |
| 324 | + "case, we can use the `TimeSeriesSplit` cross-validation to enforce this\n", |
329 | 325 | "behaviour."
|
330 | 326 | ]
|
331 | 327 | },
|
|
349 | 345 | "metadata": {},
|
350 | 346 | "source": [
|
351 | 347 | "In conclusion, it is really important to not use an out of the shelves\n",
|
352 |
| - "cross-validation strategy which do not respect some assumptions such as\n", |
353 |
| - "having i.i.d data. It might lead to absurd results which could make think\n", |
354 |
| - "that a predictive model might work." |
| 348 | + "cross-validation strategy which do not respect some assumptions such as having\n", |
| 349 | + "i.i.d data. It might lead to absurd results which could make think that a\n", |
| 350 | + "predictive model might work." |
355 | 351 | ]
|
356 | 352 | }
|
357 | 353 | ],
|
|
0 commit comments