add validation exercise

jakevdp · jakevdp · commit 8ee70f4bb6de · 2013-06-24T17:22:20.000-07:00
diff --git a/notebooks/06B_validation_exercise.ipynb b/notebooks/06B_validation_exercise.ipynb
@@ -25,13 +25,31 @@
      "metadata": {},
      "outputs": []
     },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "This exercise covers cross-validation of regression models on the Diabetes\n",
+      "dataset.  The diabetes data consists of 10 physiological variables (age, sex, weight, blood pressure)\n",
+      "measure on 442 patients, and an indication of disease progression after one year:"
+     ]
+    },
     {
      "cell_type": "code",
      "collapsed": false,
      "input": [
       "from sklearn.datasets import load_diabetes\n",
       "data = load_diabetes()\n",
-      "X, y = data.data, data.target\n",
+      "X, y = data.data, data.target"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": []
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
       "print X.shape"
      ],
      "language": "python",
@@ -42,97 +60,246 @@
      "cell_type": "code",
      "collapsed": false,
      "input": [
-      "from sklearn.linear_model import Ridge, Lasso\n",
+      "print y.shape"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": []
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "Here we'll be fitting two regularized linear models,\n",
+      "*Ridge Regression*, which uses $\\ell_2$ regularlization,\n",
+      "and *Lasso Regression*, which uses $\\ell_1$ regularization."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "from sklearn.linear_model import Ridge, Lasso"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": []
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "We'll first use the default hyper-parameters to see the baseline estimator.  We'll\n",
+      "use the cross-validation score to determine goodness-of-fit."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
       "from sklearn.cross_validation import cross_val_score\n",
       "\n",
-      "alphas = np.logspace(-4, 0, 20)\n",
-      "\n",
-      "# plot the mean cross-validation score for a Ridge estimator and a Lasso estimator\n",
-      "# as a function of alpha.  Which is more difficult to tune?\n",
-      "\n"
+      "for Model in [Ridge, Lasso]:\n",
+      "    model = Model()\n",
+      "    print Model.__name__, cross_val_score(model, X, y).mean()"
      ],
      "language": "python",
      "metadata": {},
      "outputs": []
     },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "We see that for the default hyper-parameter values, Lasso outperforms Ridge.\n",
+      "But is this the case for the *optimal* hyperparameters of each model?"
+     ]
+    },
+    {
+     "cell_type": "heading",
+     "level": 2,
+     "metadata": {},
+     "source": [
+      "Exercise: Basic Hyperparameter Optimization"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "Here spend some time writing a function which computes the cross-validation\n",
+      "score as a function of ``alpha``, the strength of the regularization for\n",
+      "``Lasso`` and ``Ridge``.  We'll choose 20 values of ``alpha`` between\n",
+      "0.0001 and 1:"
+     ]
+    },
     {
      "cell_type": "code",
      "collapsed": false,
      "input": [
-      "clf = Lasso(alpha=1.0).fit(X, y)\n",
-      "clf.sparse_coef_.toarray()"
+      "alphas = np.logspace(-3, -1, 30)\n",
+      "\n",
+      "# plot the mean cross-validation score for a Ridge estimator and a Lasso estimator\n",
+      "# as a function of alpha.  Which is more difficult to tune?"
      ],
      "language": "python",
      "metadata": {},
      "outputs": []
     },
+    {
+     "cell_type": "heading",
+     "level": 3,
+     "metadata": {},
+     "source": [
+      "Solution"
+     ]
+    },
     {
      "cell_type": "code",
      "collapsed": false,
      "input": [
-      "from sklearn.linear_model import Lasso, LassoCV, Ridge, RidgeCV\n",
-      "from sklearn.cross_validation import cross_val_score"
+      "%load solutions/06B_basic_grid_search.py"
      ],
      "language": "python",
      "metadata": {},
      "outputs": []
     },
+    {
+     "cell_type": "heading",
+     "level": 2,
+     "metadata": {},
+     "source": [
+      "Automatically Performing Grid Search"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "Because searching a grid of hyperparameters is such a common task, scikit-learn provides\n",
+      "several hyper-parameter estimators to automate this.  We'll explore this more in depth\n",
+      "later in the tutorial, but for now it is interesting to see how ``GridSearchCV`` works:"
+     ]
+    },
     {
      "cell_type": "code",
      "collapsed": false,
      "input": [
-      "alphas = 10 ** np.linspace(-3, -1, 20)\n",
-      "scores_L = [cross_val_score(Lasso(alpha), X, y)\n",
-      "            for alpha in alphas]\n",
-      "scores_R = [cross_val_score(Ridge(alpha), X, y)\n",
-      "            for alpha in alphas]\n",
-      "plt.plot(alphas, np.mean(scores_R, 1), label=\"Ridge\")\n",
-      "plt.plot(alphas, np.mean(scores_L, 1), label=\"Lasso\")\n",
-      "plt.xlabel('alpha')\n",
-      "plt.ylabel('score')\n",
-      "plt.legend()"
+      "from sklearn.grid_search import GridSearchCV"
      ],
      "language": "python",
      "metadata": {},
      "outputs": []
     },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "``GridSearchCV`` is constructed with an estimator, as well as a dictionary\n",
+      "of parameter values to be searched.  We can find the optimal parameters this\n",
+      "way:"
+     ]
+    },
     {
      "cell_type": "code",
      "collapsed": false,
      "input": [
-      "from sklearn.linear_model import LassoCV"
+      "for Model in [Ridge, Lasso]:\n",
+      "    gscv = GridSearchCV(Model(), dict(alpha=alphas), cv=3).fit(X, y)\n",
+      "    print Model.__name__, gscv.best_params_"
      ],
      "language": "python",
      "metadata": {},
      "outputs": []
     },
+    {
+     "cell_type": "heading",
+     "level": 2,
+     "metadata": {},
+     "source": [
+      "Built-in Hyperparameter Search"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "For some models within scikit-learn, cross-validation can be performed more efficiently\n",
+      "on large datasets.  In this case, a cross-validated version of the particular model is\n",
+      "included.  The cross-validated versions of ``Ridge`` and ``Lasso`` are ``RidgeCV`` and\n",
+      "``LassoCV``, respectively.  The grid search on these estimators can be performed as\n",
+      "follows:"
+     ]
+    },
     {
      "cell_type": "code",
      "collapsed": false,
      "input": [
-      "clf = LassoCV(alphas=alphas, cv=3).fit(X, y)\n",
-      "print \"Lasso:\", clf.alpha_\n",
-      "clf = RidgeCV(alphas=alphas, cv=3).fit(X, y)\n",
-      "print \"Ridge:\", clf.alpha_"
+      "from sklearn.linear_model import RidgeCV, LassoCV\n",
+      "for Model in [RidgeCV, LassoCV]:\n",
+      "    model = Model(alphas=alphas, cv=3).fit(X, y)\n",
+      "    print Model.__name__, model.alpha_"
      ],
      "language": "python",
      "metadata": {},
      "outputs": []
     },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "We see that the results match those returned by ``GridSearchCV``."
+     ]
+    },
+    {
+     "cell_type": "heading",
+     "level": 2,
+     "metadata": {},
+     "source": [
+      "Exercise: Learning Curves"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "Here we'll apply our learning curves to the diabetes data.  The question to answer is this:\n",
+      "\n",
+      "- Given the optimal models above, which is over-fitting and which is under-fitting the data?\n",
+      "- To obtain better results, would you invest time and effort in gathering\n",
+      "  more **training samples**, or gathering more **attributes** for each sample?\n",
+      "  Recall the previous discussion of reading learning curves.\n",
+      "\n",
+      "You can follow the process used in the previous notebook to plot the learning curves.\n",
+      "A good metric to use is the ``mean_squared_error``, which we'll import below:"
+     ]
+    },
     {
      "cell_type": "code",
      "collapsed": false,
      "input": [
-      "RidgeCV?"
+      "from sklearn.metrics import mean_squared_error\n",
+      "# define a function that computes the learning curve (i.e. mean_squared_error as a function\n",
+      "# of training set size, for both training and test sets) and plot the result\n"
      ],
      "language": "python",
      "metadata": {},
      "outputs": []
     },
+    {
+     "cell_type": "heading",
+     "level": 3,
+     "metadata": {},
+     "source": [
+      "Solution"
+     ]
+    },
     {
      "cell_type": "code",
      "collapsed": false,
-     "input": [],
+     "input": [
+      "%load solutions/06B_learning_curves.py"
+     ],
      "language": "python",
      "metadata": {},
      "outputs": []
diff --git a/notebooks/solutions/06B_basic_grid_search.py b/notebooks/solutions/06B_basic_grid_search.py
@@ -0,0 +1,5 @@
+for Model in [Lasso, Ridge]:
+    scores = [cross_val_score(Model(alpha), X, y, cv=3).mean()
+              for alpha in alphas]
+    plt.plot(alphas, scores, label=Model.__name__)
+plt.legend(loc='lower left')
diff --git a/notebooks/solutions/06B_learning_curves.py b/notebooks/solutions/06B_learning_curves.py
@@ -0,0 +1,40 @@
+from sklearn.metrics import explained_variance_score, mean_squared_error
+from sklearn.cross_validation import train_test_split
+
+def plot_learning_curve(model, err_func=explained_variance_score, N=300, n_runs=10, n_sizes=50, ylim=None):
+    sizes = np.linspace(5, N, n_sizes).astype(int)
+    train_err = np.zeros((n_runs, n_sizes))
+    validation_err = np.zeros((n_runs, n_sizes))
+    for i in range(n_runs):
+        for j, size in enumerate(sizes):
+            xtrain, xtest, ytrain, ytest = train_test_split(
+                X, y, train_size=size, random_state=i)
+            # Train on only the first `size` points
+            model.fit(xtrain, ytrain)
+            validation_err[i, j] = err_func(ytest, model.predict(xtest))
+            train_err[i, j] = err_func(ytrain, model.predict(xtrain))
+
+    plt.plot(sizes, validation_err.mean(axis=0), lw=2, label='validation')
+    plt.plot(sizes, train_err.mean(axis=0), lw=2, label='training')
+
+    plt.xlabel('traning set size')
+    plt.ylabel(err_func.__name__.replace('_', ' '))
+    
+    plt.grid(True)
+    
+    plt.legend(loc=0)
+    
+    plt.xlim(0, N-1)
+    
+    if ylim:
+        plt.ylim(ylim)
+
+
+plt.figure(figsize=(10, 8))
+for i, model in enumerate([Lasso(0.01), Ridge(0.06)]):
+    plt.subplot(221 + i)
+    plot_learning_curve(model, ylim=(0, 1))
+    plt.title(model.__class__.__name__)
+    
+    plt.subplot(223 + i)
+    plot_learning_curve(model, err_func=mean_squared_error, ylim=(0, 8000))