Update notebooks

lesteve · lesteve · commit aaeecc07d202 · 2022-10-18T11:10:02.000+02:00
diff --git a/notebooks/03_categorical_pipeline_sol_02.ipynb b/notebooks/03_categorical_pipeline_sol_02.ipynb
@@ -250,7 +250,7 @@
     "<div class=\"admonition important alert alert-info\">\n",
     "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Important</p>\n",
     "<p>Which encoder should I use?</p>\n",
-    "<table border=\"1\" class=\"colwidths-auto docutils\">\n",
+    "<table border=\"1\" class=\"docutils\">\n",
     "<thead valign=\"bottom\">\n",
     "<tr><th class=\"head\"></th>\n",
     "<th class=\"head\">Meaningful order</th>\n",
diff --git a/notebooks/03_categorical_pipeline_visualization.ipynb b/notebooks/03_categorical_pipeline_visualization.ipynb
@@ -4,7 +4,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# How to define a scikit-learn pipeline and visualize it"
+    "# Visualizing scikit-learn pipelines in Jupyter"
    ]
   },
   {
@@ -22,7 +22,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### First we load the dataset"
+    "## First we load the dataset"
    ]
   },
   {
@@ -86,7 +86,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Then we create the pipeline"
+    "## Then we create the pipeline"
    ]
   },
   {
@@ -176,7 +176,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Finally we score the model"
+    "## Finally we score the model"
    ]
   },
   {
diff --git a/notebooks/ensemble_hyperparameters.ipynb b/notebooks/ensemble_hyperparameters.ipynb
@@ -17,28 +17,12 @@
     "<div class=\"admonition caution alert alert-warning\">\n",
     "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Caution!</p>\n",
     "<p class=\"last\">For the sake of clarity, no cross-validation will be used to estimate the\n",
-    "testing error. We are only showing the effect of the parameters\n",
-    "on the validation set of what should be the inner cross-validation.</p>\n",
+    "variability of the testing error. We are only showing the effect of the\n",
+    "parameters on the validation set of what should be the inner loop of a nested\n",
+    "cross-validation.</p>\n",
     "</div>\n",
     "\n",
-    "## Random forest\n",
-    "\n",
-    "The main parameter to tune for random forest is the `n_estimators` parameter.\n",
-    "In general, the more trees in the forest, the better the generalization\n",
-    "performance will be. However, it will slow down the fitting and prediction\n",
-    "time. The goal is to balance computing time and generalization performance when\n",
-    "setting the number of estimators when putting such learner in production.\n",
-    "\n",
-    "Then, we could also tune a parameter that controls the depth of each tree in\n",
-    "the forest. Two parameters are important for this: `max_depth` and\n",
-    "`max_leaf_nodes`. They differ in the way they control the tree structure.\n",
-    "Indeed, `max_depth` will enforce to have a more symmetric tree, while\n",
-    "`max_leaf_nodes` does not impose such constraint.\n",
-    "\n",
-    "Be aware that with random forest, trees are generally deep since we are\n",
-    "seeking to overfit each tree on each bootstrap sample because this will be\n",
-    "mitigated by combining them altogether. Assembling underfitted trees (i.e.\n",
-    "shallow trees) might also lead to an underfitted forest."
+    "We will start by loading the california housing dataset."
    ]
   },
   {
@@ -56,6 +40,71 @@
     "    data, target, random_state=0)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Random forest\n",
+    "\n",
+    "The main parameter to select in random forest is the `n_estimators` parameter.\n",
+    "In general, the more trees in the forest, the better the generalization\n",
+    "performance will be. However, it will slow down the fitting and prediction\n",
+    "time. The goal is to balance computing time and generalization performance\n",
+    "when setting the number of estimators. Here, we fix `n_estimators=100`, which\n",
+    "is already the default value.\n",
+    "\n",
+    "<div class=\"admonition caution alert alert-warning\">\n",
+    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Caution!</p>\n",
+    "<p class=\"last\">Tuning the <tt class=\"docutils literal\">n_estimators</tt> for random forests generally result in a waste of\n",
+    "computer power. We just need to ensure that it is large enough so that doubling\n",
+    "its value does not lead to a significant improvement of the validation error.</p>\n",
+    "</div>\n",
+    "\n",
+    "Instead, we can tune the hyperparameter `max_features`, which controls the\n",
+    "size of the random subset of features to consider when looking for the best\n",
+    "split when growing the trees: smaller values for `max_features` will lead to\n",
+    "more random trees with hopefully more uncorrelated prediction errors. However\n",
+    "if `max_features` is too small, predictions can be too random, even after\n",
+    "averaging with the trees in the ensemble.\n",
+    "\n",
+    "If `max_features` is set to `None`, then this is equivalent to setting\n",
+    "`max_features=n_features` which means that the only source of randomness in\n",
+    "the random forest is the bagging procedure."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(f\"In this case, n_features={len(data.columns)}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can also tune the different parameters that control the depth of each tree\n",
+    "in the forest. Two parameters are important for this: `max_depth` and\n",
+    "`max_leaf_nodes`. They differ in the way they control the tree structure.\n",
+    "Indeed, `max_depth` will enforce to have a more symmetric tree, while\n",
+    "`max_leaf_nodes` does not impose such constraint. If `max_leaf_nodes=None`\n",
+    "then the number of leaf nodes is unlimited.\n",
+    "\n",
+    "The hyperparameter `min_samples_leaf` controls the minimum number of samples\n",
+    "required to be at a leaf node. This means that a split point (at any depth) is\n",
+    "only done if it leaves at least `min_samples_leaf` training samples in each of\n",
+    "the left and right branches. A small value for `min_samples_leaf` means that\n",
+    "some samples can become isolated when a tree is deep, promoting overfitting. A\n",
+    "large value would prevent deep trees, which can lead to underfitting.\n",
+    "\n",
+    "Be aware that with random forest, trees are expected to be deep since we are\n",
+    "seeking to overfit each tree on each bootstrap sample. Overfitting is\n",
+    "mitigated when combining the trees altogether, whereas assembling underfitted\n",
+    "trees (i.e. shallow trees) might also lead to an underfitted forest."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -67,8 +116,9 @@
     "from sklearn.ensemble import RandomForestRegressor\n",
     "\n",
     "param_distributions = {\n",
-    "    \"n_estimators\": [1, 2, 5, 10, 20, 50, 100, 200, 500],\n",
-    "    \"max_leaf_nodes\": [2, 5, 10, 20, 50, 100],\n",
+    "    \"max_features\": [1, 2, 3, 5, None],\n",
+    "    \"max_leaf_nodes\": [10, 100, 1000, None],\n",
+    "    \"min_samples_leaf\": [1, 2, 5, 10, 20, 50, 100],\n",
     "}\n",
     "search_cv = RandomizedSearchCV(\n",
     "    RandomForestRegressor(n_jobs=2), param_distributions=param_distributions,\n",
@@ -88,15 +138,21 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We can observe in our search that we are required to have a large\n",
-    "number of leaves and thus deep trees. This parameter seems particularly\n",
-    "impactful in comparison to the number of trees for this particular dataset:\n",
-    "with at least 50 trees, the generalization performance will be driven by the\n",
-    "number of leaves.\n",
-    "\n",
-    "Now we will estimate the generalization performance of the best model by\n",
-    "refitting it with the full training set and using the test set for scoring on\n",
-    "unseen data. This is done by default when calling the `.fit` method."
+    "We can observe in our search that we are required to have a large number of\n",
+    "`max_leaf_nodes` and thus deep trees. This parameter seems particularly\n",
+    "impactful with respect to the other tuning parameters, but large values of\n",
+    "`min_samples_leaf` seem to reduce the performance of the model.\n",
+    "\n",
+    "In practice, more iterations of random search would be necessary to precisely\n",
+    "assert the role of each parameters. Using `n_iter=10` is good enough to\n",
+    "quickly inspect the hyperparameter combinations that yield models that work\n",
+    "well enough without spending too much computational resources. Feel free to\n",
+    "try more interations on your own.\n",
+    "\n",
+    "Once the `RandomizedSearchCV` has found the best set of hyperparameters, it\n",
+    "uses them to refit the model using the full training set. To estimate the\n",
+    "generalization performance of the best model it suffices to call `.score` on\n",
+    "the unseen data."
    ]
   },
   {
@@ -180,8 +236,8 @@
     "\n",
     "<div class=\"admonition caution alert alert-warning\">\n",
     "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Caution!</p>\n",
-    "<p class=\"last\">Here, we tune the <tt class=\"docutils literal\">n_estimators</tt> but be aware that using early-stopping as\n",
-    "in the previous exercise will be better.</p>\n",
+    "<p class=\"last\">Here, we tune the <tt class=\"docutils literal\">n_estimators</tt> but be aware that is better to use\n",
+    "<tt class=\"docutils literal\">early_stopping</tt> as done in the Exercise M6.04.</p>\n",
     "</div>\n",
     "\n",
     "In this search, we see that the `learning_rate` is required to be large\n",
@@ -196,8 +252,8 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Now we estimate the generalization performance of the best model\n",
-    "using the test set."
+    "Now we estimate the generalization performance of the best model using the\n",
+    "test set."
    ]
   },
   {
@@ -216,8 +272,8 @@
    "source": [
     "The mean test score in the held-out test set is slightly better than the score\n",
     "of the best model. The reason is that the final model is refitted on the whole\n",
-    "training set and therefore, on more data than the inner cross-validated models\n",
-    "of the grid search procedure."
+    "training set and therefore, on more data than the cross-validated models of\n",
+    "the grid search procedure."
    ]
   }
  ],

Original file line number	Diff line number	Diff line change
`@@ -4,7 +4,7 @@`
`4`	`4`	`"cell_type": "markdown",`
`5`	`5`	`"metadata": {},`
`6`	`6`	`"source": [`
`7`		`- "# How to define a scikit-learn pipeline and visualize it"`
	`7`	`+ "# Visualizing scikit-learn pipelines in Jupyter"`
`8`	`8`	`]`
`9`	`9`	`},`
`10`	`10`	`{`
`@@ -22,7 +22,7 @@`
`22`	`22`	`"cell_type": "markdown",`
`23`	`23`	`"metadata": {},`
`24`	`24`	`"source": [`
`25`		`- "### First we load the dataset"`
	`25`	`+ "## First we load the dataset"`
`26`	`26`	`]`
`27`	`27`	`},`
`28`	`28`	`{`
`@@ -86,7 +86,7 @@`
`86`	`86`	`"cell_type": "markdown",`
`87`	`87`	`"metadata": {},`
`88`	`88`	`"source": [`
`89`		`- "### Then we create the pipeline"`
	`89`	`+ "## Then we create the pipeline"`
`90`	`90`	`]`
`91`	`91`	`},`
`92`	`92`	`{`
`@@ -176,7 +176,7 @@`
`176`	`176`	`"cell_type": "markdown",`
`177`	`177`	`"metadata": {},`
`178`	`178`	`"source": [`
`179`		`- "### Finally we score the model"`
	`179`	`+ "## Finally we score the model"`
`180`	`180`	`]`
`181`	`181`	`},`
`182`	`182`	`{`