-
Notifications
You must be signed in to change notification settings - Fork 1.9k
L1-norm and L2-norm regularization doc #3586
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
1cb88a0
ad3d98e
a05da4a
31eaa2d
b7c88c6
4cca10d
5c95b53
8feae96
bf948b6
74e81ff
2cb578a
0c7d9f9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
- Loading branch information
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -28,35 +28,27 @@ non-strongly-convex cases, you will get equally-good solutions from run to run. | |
For reproducible results, it is recommended that one sets 'Shuffle' to False and | ||
'NumThreads' to 1. | ||
|
||
This learner supports [elastic net | ||
regularization](https://en.wikipedia.org/wiki/Elastic_net_regularization): a | ||
linear combination of the L1 and L2 penalties of the [lasso and ridge | ||
methods](https://www.datacamp.com/community/tutorials/tutorial-ridge-lasso-elastic-net). | ||
It can be specified by the 'L2Const' and 'L1Threshold' parameters. Note that the | ||
'L2Const' has an effect on the number of needed training iterations. In general, | ||
the larger the 'L2Const', the less number of iterations is needed to achieve a | ||
reasonable solution. Regularization is a method that can render an ill-posed | ||
problem more tractable and prevents overfitting by penalizing model's magnitude | ||
usually measured by some norm functions. This can improve the generalization of | ||
the model learned by selecting the optimal complexity in the [bias-variance | ||
tradeoff](https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff). | ||
Regularization works by adding the penalty that is associated with coefficient | ||
values to the error of the hypothesis. An accurate model with extreme | ||
coefficient values would be penalized more, but a less accurate model with more | ||
conservative values would be penalized less. | ||
This class use [empricial risk minimization](https://en.wikipedia.org/wiki/Empirical_risk_minimization) to formulate the optimized problem built upon collected data. | ||
If the training data does not contain enough data points (for example, to train a linear model in $n$-dimensional space, we at least need $n$ data points), | ||
(overfitting)(https://en.wikipedia.org/wiki/Overfitting) may happen so the trained model is good at describing training data but may fail to predict correct results in unseen events. | ||
[Regularization](https://en.wikipedia.org/wiki/Regularization_(mathematics)) is a common technique to alleviate such a phenomenon by penalizing the magnitude (usually measureed by [norm function](https://en.wikipedia.org/wiki/Norm_(mathematics))) of model parameters. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
typo: measured #Resolved |
||
This trainer supports [elastic net regularization](https://en.wikipedia.org/wiki/Elastic_net_regularization) which penalizing a linear combination of L1-norm (LASSO), $|| \textbf{w}_c ||_1$, and L2-norm (ridge), $|| \textbf{w}_c ||_2^2$ regularizations. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
penalizes #Resolved |
||
L1-norm and L2-norm regularizations have different effects and uses that are complementary in certain respects. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add blank line for new paragraph. #Resolved |
||
Togehter with the implemented optimization algorithm, L1-norm regularization can increase the sparsity of the $\textbf{w}_1,\dots,\textbf{w}_m$. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
let's call this something like, model weights, or model parameters, and not repeat it below. #Resolved |
||
For high-dimention and sparse data set, if user carefully select the coefficient of L1-norm, it is possible to achieve a good prediction quality with a model with a few of non-zeros (e.g., 1% values) in $\textbf{w}_1,\dots,\textbf{w}_m$ without affecting its . | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
missing a word #Resolved There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
dimension please use spell checker plugin #Resolved There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It doesn't show anything.. It was ok yesterday. Let me try Vim's. In reply to: 278724688 [](ancestors = 278724688) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you're right, the plugin is useless in markdown sections. not sure it there's option to make it look at those sections. You might be able to temporarily remove the CDATA tags, and see if typos show up like the way they do in #ResolvedThere was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
that has only a few non-zero weights #Resolved There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
1% of weights #Resolved |
||
In contrast, L2-norm can not increase the sparsity of the trained model but can still prevernt overfitting by avoiding large parameter values. | ||
Sometimes, using L2-norm leads to a better prediction quality, so user may still want try it and fine tune the coefficints of L1-norm and L2-norm. | ||
Note that conceptually, using L1-norm implies that the distribution of all model parameters is a [Laplace distribution](https://en.wikipedia.org/wiki/Laplace_distribution) while | ||
L2-norm means that a [Gaussian distribution](https://en.wikipedia.org/wiki/Normal_distribution) for them. | ||
|
||
L1-nrom and L2-norm regularizations have different effects and uses that are | ||
complementary in certain respects. Using L1-norm can increase sparsity of the | ||
trained $\textbf{w}$. When working with high-dimensional data, it shrinks small | ||
weights of irrevalent features to 0 and therefore no reource will be spent on | ||
those bad features when making prediction. L2-norm regularization is preferable | ||
for data that is not sparse and it largely penalizes the existence of large | ||
weights. | ||
An aggressive regularization (that is, assigning large coefficients to L1-norm or L2-norm regularization terms) can harm predictive capacity by excluding important variables out of the model. | ||
Therefore, choosing the right regularization coefficients is important in practice. | ||
For example, a very large L1-norm coefficient may force all parameters to be zeros and lead to a trivial model. | ||
|
||
For more information, see: | ||
* [Scaling Up Stochastic Dual Coordinate | ||
Ascent.](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/06/main-3.pdf) | ||
* [Stochastic Dual Coordinate Ascent Methods for Regularized Loss | ||
Minimization.](http://www.jmlr.org/papers/volume14/shalev-shwartz13a/shalev-shwartz13a.pdf) | ||
|
||
Check the See Also section for links to examples of the usage. | ||
Check the See Also section for links to examples of the usage. |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -62,19 +62,22 @@ namespace Microsoft.ML.Trainers | |
/// The optimization algorithm is an extension of (http://jmlr.org/papers/volume14/shalev-shwartz13a/shalev-shwartz13a.pdf) following a similar path proposed in an earlier [paper](https://www.csie.ntu.edu.tw/~cjlin/papers/maxent_dual.pdf). | ||
/// It is usually much faster than [L-BFGS](https://en.wikipedia.org/wiki/Limited-memory_BFGS) and [truncated Newton methods](https://en.wikipedia.org/wiki/Truncated_Newton_method) for large-scale and sparse data set. | ||
/// | ||
/// Regularization is a method that can render an ill-posed problem more tractable by imposing constraints that provide information to supplement the data and that prevents overfitting by penalizing model's magnitude usually measured by some norm functions. | ||
/// This can improve the generalization of the model learned by selecting the optimal complexity in the bias-variance tradeoff. | ||
/// Regularization works by adding the penalty on the magnitude of $\textbf{w}_c$, $c=1,\dots,m$ to the error of the hypothesis. | ||
/// An accurate model with extreme coefficient values would be penalized more, but a less accurate model with more conservative values would be penalized less. | ||
/// | ||
/// This trainer supports [elastic net regularization](https://en.wikipedia.org/wiki/Elastic_net_regularization): a linear combination of L1-norm (LASSO), $|| \textbf{w}_c ||_1$, and L2-norm (ridge), $|| \textbf{w}_c ||_2^2$ regularizations. | ||
/// This class use [empricial risk minimization](https://en.wikipedia.org/wiki/Empirical_risk_minimization) to formulate the optimized problem built upon collected data. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
optimization ? #Resolved |
||
/// If the training data does not contain enough data points (for example, to train a linear model in $n$-dimensional space, we at least need $n$ data points), | ||
/// (overfitting)(https://en.wikipedia.org/wiki/Overfitting) may happen so the trained model is good at describing training data but may fail to predict correct results in unseen events. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
when #Resolved There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
/// [Regularization](https://en.wikipedia.org/wiki/Regularization_(mathematics)) is a common technique to alleviate such a phenomenon by penalizing the magnitude (usually measureed by [norm function](https://en.wikipedia.org/wiki/Norm_(mathematics))) of model parameters. | ||
/// This trainer supports [elastic net regularization](https://en.wikipedia.org/wiki/Elastic_net_regularization) which penalizing a linear combination of L1-norm (LASSO), $|| \textbf{w}_c ||_1$, and L2-norm (ridge), $|| \textbf{w}_c ||_2^2$ regularizations. | ||
/// L1-norm and L2-norm regularizations have different effects and uses that are complementary in certain respects. | ||
/// Using L1-norm can increase sparsity of the trained $\textbf{w}_c$. | ||
/// When working with high-dimensional data, it shrinks small weights of irrelevant features to 0 and therefore no resource will be spent on those bad features when making prediction. | ||
/// L2-norm regularization is preferable for data that is not sparse and it largely penalizes the existence of large weights. | ||
/// Togehter with the implemented optimization algorithm, L1-norm regularization can increase the sparsity of the $\textbf{w}_1,\dots,\textbf{w}_m$. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
typo #Resolved There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
typo #Resolved There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
uses #Pending There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
/// For high-dimention and sparse data set, if user carefully select the coefficient of L1-norm, it is possible to achieve a good prediction quality with a model with a few of non-zeros (e.g., 1% values) in $\textbf{w}_1,\dots,\textbf{w}_m$ without affecting its . | ||
/// In contrast, L2-norm can not increase the sparsity of the trained model but can still prevernt overfitting by avoiding large parameter values. | ||
/// Sometimes, using L2-norm leads to a better prediction quality, so user may still want try it and fine tune the coefficints of L1-norm and L2-norm. | ||
/// Note that conceptually, using L1-norm implies that the distribution of all model parameters is a [Laplace distribution](https://en.wikipedia.org/wiki/Laplace_distribution) while | ||
/// L2-norm means that a [Gaussian distribution](https://en.wikipedia.org/wiki/Normal_distribution) for them. | ||
/// | ||
/// An aggressive regularization (that is, assigning large coefficients to L1-norm or L2-norm regularization terms) can harm predictive capacity by excluding important variables out of the model. | ||
/// Therefore, choosing the right regularization coefficients is important in practice. | ||
/// For example, a very large L1-norm coefficient may force all parameters to be zeros and lead to a trivial model. | ||
/// | ||
/// Check the See Also section for links to usage examples. | ||
/// ]]> | ||
|
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please limit the line width, so that it's easy to review without going left and right. it's also a good practice for viewing the file on github. #Resolved