Skip to content

Commit a43b274

Browse files
committed
syntax
1 parent e463573 commit a43b274

File tree

2 files changed

+117
-86
lines changed

2 files changed

+117
-86
lines changed

README.ipynb

Lines changed: 117 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Regularization methods\n",
8+
"## Shrinkage methods for regression models restriction\n",
9+
"\n",
10+
"There are two reasons why we are often not satisfied with the least squares\n",
11+
"estimates.\n",
12+
"\n",
13+
"* The first is *prediction accuracy*: the least squares estimates often have\n",
14+
"low bias but large variance. Prediction accuracy can sometimes be\n",
15+
"improved by shrinking or setting some coefficients to zero. By doing\n",
16+
"so we sacrifice a little bit of bias to reduce the variance of the predicted\n",
17+
"values, and hence may improve the overall prediction accuracy.\n",
18+
"\n",
19+
"* The second reason is *interpretation*. With a large number of predictors,\n",
20+
"we often would like to determine a smaller subset that exhibit\n",
21+
"the strongest effects. In order to get the “big picture,” we are willing\n",
22+
"to sacrifice some of the small details.\n",
23+
"\n",
24+
"Some of the most often used methods for model restriction include *Forward- and Backward-Stepwise Selection* or *Stagewise regression*. By retaining a subset of the predictors and discarding the rest, subset selection\n",
25+
"produces a model that is interpretable and has possibly lower prediction\n",
26+
"error than the full model. However, because it is a discrete process—\n",
27+
"variables are either retained or discarded—it often exhibits high variance,\n",
28+
"and so doesn’t reduce the prediction error of the full model. *Shrinkage*\n",
29+
"*methods* (namely Ridge,Lasso and Elastic Net) are more continuous and don’t suffer as much from high\n",
30+
"variability.\n",
31+
"\n",
32+
"As a continuous shrinkage method, Ridge regression achieves its better prediction performance through a bias–variance trade-off. However, ridge regression cannot produce a parsimonious model, for it always keeps all the predictors in the model.\n",
33+
"\n",
34+
"A promising technique called the Lasso was proposed by Tibshirani (1996). The lasso is a penalized least squares method imposing an L1-penalty on the regression coefficients. Owing to the nature of the L1-penalty, the lasso does both continuous shrinkage and automatic variable selection simultaneously. On the other hand introduction of L1 penalty causes no closed form solution in non-orthonormal case. This makes computation of Lasso estimates a quadratic programming problem.\n",
35+
"\n",
36+
"## Lasso estimation algorithms\n",
37+
"### The LARS Algorithm\n",
38+
"\n",
39+
"At the first step it identifies the variable\n",
40+
"most correlated with the response. Rather than fit this variable completely,\n",
41+
"LAR moves the coefficient of this variable continuously toward its leasts quares\n",
42+
"value (causing its correlation with the evolving residual to decrease\n",
43+
"in absolute value). As soon as another variable “catches up” in terms of\n",
44+
"correlation with the residual, the process is paused. The second variable\n",
45+
"then joins the active set, and their coefficients are moved together in a way\n",
46+
"that keeps their correlations tied and decreasing.This process is continued until all the variables are in the model, and ends at the full least-squares\n",
47+
"fit.\n",
48+
"\n",
49+
"##### Naive LARS\n",
50+
"\n",
51+
"1. Standardize the predictors to have mean zero and unit norm. Start\n",
52+
"with the residual $r={y}-\\overline{y}; \\, \\beta_1,\\beta_2,...,\\beta_p = 0$\n",
53+
"2. Find the predictor $x_j$ most correlated with $r$\n",
54+
"3. Move $\\beta_j$ from 0 towards its least-squares coefficient $<x_j,r>$ until some\n",
55+
"other competitor $x_k$ has as much correlation with the current residual\n",
56+
"as does $x_j$\n",
57+
"4. Move $\\beta_j$ and $\\beta_k$ in the direction defined by their joint least squares\n",
58+
"coefficient of the current residual on $(x_j,x_k)$, until some other competitor\n",
59+
"$x_l$ has as much correlation with the current residual\n",
60+
"5. Continue in this way until all $p$ predictors have been entered. After\n",
61+
"$min(N − 1,p)$ steps, we arrive at the full least-squares solution.\n",
62+
"\n",
63+
"Tibshirani,Hastie&Friedman (2009) showed that LAR is almost identical to lasso path and differ only when coefficient crosses zero value. It appears that just simple modification of the LAR algorithm gives the entire\n",
64+
"lasso path, which is also piecewise-linear.\n",
65+
"\n",
66+
"##### Lasso modification\n",
67+
"\n",
68+
"* 4a. If a non-zero coefficient hits zero, drop its variable from the active set\n",
69+
"of variables and recompute the current joint least squares direction.\n",
70+
"\n",
71+
"All of the steps above introduce an efficient way for computing lasso having the same order of computation as Cholesky or QR decomposition which are used for least squares fitting.\n",
72+
"\n",
73+
"### Path-wise coordinate descent algorithm\n",
74+
"\n",
75+
"An alternate approach to the LARS algorithm for computing the lasso\n",
76+
"solution is simple coordinate descent. This idea was proposed by Fu (1998)\n",
77+
"and Daubechies et al. (2004), and later studied and generalized by Friedman\n",
78+
"et al. (2007). The idea is to fix the penalty\n",
79+
"parameter $\\lambda$ in the Lagrangian form and optimize successively over\n",
80+
"each parameter, holding the other parameters fixed at their current values. This method is also called “one-at-atime”\n",
81+
"coordinate-wise descent algorithm in the literature.\n",
82+
"\n",
83+
"A key point here: coordinate descent works so well because minimization can be done quickly,\n",
84+
"and the relevant equations can be updated as we cycle through the variables.That makes it faster than LARS algorithm especially in large problems.\n",
85+
"\n",
86+
"By rearranging general Lasso Lagrangian form we can view problem as univariate lasso problem with explicit solution resulting in update\n",
87+
"\n",
88+
"$$\\hat\\beta(\\lambda )← S(\\sum_{i=1}^{N}{x_{ij}(y_i-\\hat y^{(j)})},\\lambda)$$\n",
89+
"Here $S(\\hat\\beta, \\lambda) = sign(\\hat\\beta)(|\\hat\\beta|−\\lambda)_+$ is so called soft-thresholding operator. The first argument to $S(·)$ is the simple least-squares coefficient\n",
90+
"of the partial residual on the standardized variable $x_{ij}$ . Repeated iteration\n",
91+
"of updating function above—cycling through each variable in turn until convergence—yields\n",
92+
"the lasso estimate $\\hat\\beta(\\lambda)$"
93+
]
94+
}
95+
],
96+
"metadata": {
97+
"kernelspec": {
98+
"display_name": "Python [conda env:tf]",
99+
"language": "python",
100+
"name": "conda-env-tf-py"
101+
},
102+
"language_info": {
103+
"codemirror_mode": {
104+
"name": "ipython",
105+
"version": 3
106+
},
107+
"file_extension": ".py",
108+
"mimetype": "text/x-python",
109+
"name": "python",
110+
"nbconvert_exporter": "python",
111+
"pygments_lexer": "ipython3",
112+
"version": "3.6.6"
113+
}
114+
},
115+
"nbformat": 4,
116+
"nbformat_minor": 2
117+
}

README.md

Lines changed: 0 additions & 86 deletions
This file was deleted.

0 commit comments

Comments
 (0)