Use train / test split in the classification introduction to avoid teaching bad habits

ogrisel · ogrisel · commit b1c1841d5e79 · 2013-06-24T15:25:48.000-05:00
diff --git a/notebooks/04A_supervised_classification.ipynb b/notebooks/04A_supervised_classification.ipynb
@@ -209,7 +209,7 @@
      "collapsed": false,
      "input": [
       "from sklearn.naive_bayes import GaussianNB\n",
-      "from sklearn import cross_validation"
+      "from sklearn.cross_validation import train_test_split"
      ],
      "language": "python",
      "metadata": {},
@@ -220,21 +220,27 @@
      "collapsed": false,
      "input": [
       "# split the data into training and validation sets\n",
-      "X = digits.data\n",
-      "y = digits.target\n",
+      "X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target)\n",
       "\n",
       "# train the model\n",
       "clf = GaussianNB()\n",
-      "clf.fit(X, y)\n",
+      "clf.fit(X_train, y_train)\n",
       "\n",
       "# use the model to predict the labels of the test data\n",
-      "predicted = clf.predict(X)\n",
-      "expected = y"
+      "predicted = clf.predict(X_test)\n",
+      "expected = y_test"
      ],
      "language": "python",
      "metadata": {},
      "outputs": []
     },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "**Question**: why did we split the data into training and validation sets?"
+     ]
+    },
     {
      "cell_type": "markdown",
      "metadata": {},
@@ -253,7 +259,8 @@
       "# plot the digits: each image is 8x8 pixels\n",
       "for i in range(64):\n",
       "    ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[])\n",
-      "    ax.imshow(digits.images[i], cmap=plt.cm.binary)\n",
+      "    ax.imshow(X_test.reshape(-1, 8, 8)[i], cmap=plt.cm.binary,\n",
+      "              interpolation='nearest')\n",
       "    \n",
       "    # label the image with the target value\n",
       "    if predicted[i] == expected[i]:\n",
@@ -265,13 +272,6 @@
      "metadata": {},
      "outputs": []
     },
-    {
-     "cell_type": "markdown",
-     "metadata": {},
-     "source": [
-      "**Question: what might be a problem with judging performance based on these predictions?**"
-     ]
-    },
     {
      "cell_type": "heading",
      "level": 2,
@@ -301,6 +301,16 @@
      "metadata": {},
      "outputs": []
     },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "matches.sum() / float(len(matches))"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": []
+    },
     {
      "cell_type": "markdown",
      "metadata": {},
@@ -349,20 +359,6 @@
      "source": [
       "We see here that in particular, the numbers 1, 2, 3, and 9 are often being labeled 8."
      ]
-    },
-    {
-     "cell_type": "markdown",
-     "metadata": {},
-     "source": [
-      "As alluded to above, however, this is not a very good way to measure performance.\n",
-      "Why?  Because we are using the same data for **training** and **validation**.\n",
-      "With this metric, a classifier could be perfect by simply storing all the training\n",
-      "samples, and checking whether the \"unknown\" sample matches any exactly.  Things are\n",
-      "rarely this easy in real problems.\n",
-      "\n",
-      "In a later notebook, we'll learn how **validation sets** can be used\n",
-      "to get around this difficulty."
-     ]
     }
    ],
    "metadata": {}