-
Notifications
You must be signed in to change notification settings - Fork 1.9k
L1-norm and L2-norm regularization doc #3586
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 5 commits
1cb88a0
ad3d98e
a05da4a
31eaa2d
b7c88c6
4cca10d
5c95b53
8feae96
bf948b6
74e81ff
2cb578a
0c7d9f9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
For more information, see: | ||
* [Scaling Up Stochastic Dual Coordinate | ||
Ascent.](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/06/main-3.pdf) | ||
* [Stochastic Dual Coordinate Ascent Methods for Regularized Loss | ||
Minimization.](http://www.jmlr.org/papers/volume14/shalev-shwartz13a/shalev-shwartz13a.pdf) | ||
|
||
|
||
Check the See Also section for links to examples of the usage. | ||
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
This class use [empricial risk minimization](https://en.wikipedia.org/wiki/Empirical_risk_minimization) | ||
najeeb-kazmi marked this conversation as resolved.
Show resolved
Hide resolved
wschin marked this conversation as resolved.
Show resolved
Hide resolved
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why is this in the l1-norm and l2-norm regularization include? #Resolved There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We are optimize There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I will add In reply to: 278778511 [](ancestors = 278778511) |
||
to formulate the optimization problem built upon collected data. | ||
If the training data does not contain enough data points | ||
(for example, to train a linear model in $n$-dimensional space, we need at least $n$ data points), | ||
[overfitting](https://en.wikipedia.org/wiki/Overfitting) may happen so that | ||
the trained model is good at describing training data but may fail to predict correct results in unseen events. | ||
[Regularization](https://en.wikipedia.org/wiki/Regularization_(mathematics)) is a common technique to alleviate | ||
such a phenomenon by penalizing the magnitude (usually measured by | ||
[norm function](https://en.wikipedia.org/wiki/Norm_(mathematics))) of model parameters. | ||
najeeb-kazmi marked this conversation as resolved.
Show resolved
Hide resolved
|
||
This trainer supports [elastic net regularization](https://en.wikipedia.org/wiki/Elastic_net_regularization), | ||
which penalizes a linear combination of L1-norm (LASSO), $|| \textbf{w}_c ||_1$, and L2-norm (ridge), $|| \textbf{w}_c ||_2^2$ regularizations for $c=1,\dots,m$. | ||
L1-norm and L2-norm regularizations have different effects and uses that are complementary in certain respects. | ||
|
||
Together with the implemented optimization algorithm, L1-norm regularization can increase the sparsity of the model weights, $\textbf{w}_1,\dots,\textbf{w}_m$. | ||
For high-dimensional and sparse data sets, if users carefully select the coefficient of L1-norm, | ||
it is possible to achieve a good prediction quality with a model that has only a few non-zero weights | ||
(e.g., 1% of total model weights) without affecting its prediction power. | ||
In contrast, L2-norm can not increase the sparsity of the trained model but can still prevent overfitting by avoiding large parameter values. | ||
wschin marked this conversation as resolved.
Show resolved
Hide resolved
|
||
Sometimes, using L2-norm leads to a better prediction quality, so users may still want to try it and fine tune the coefficients of L1-norm and L2-norm. | ||
Note that conceptually, using L1-norm implies that the distribution of all model parameters is a | ||
[Laplace distribution](https://en.wikipedia.org/wiki/Laplace_distribution) while | ||
L2-norm implies a [Gaussian distribution](https://en.wikipedia.org/wiki/Normal_distribution) for them. | ||
|
||
An aggressive regularization (that is, assigning large coefficients to L1-norm or L2-norm regularization terms) | ||
can harm predictive capacity by excluding important variables from the model. | ||
For example, a very large L1-norm coefficient may force all parameters to be zeros and lead to a trivial model. | ||
Therefore, choosing the right regularization coefficients is important in practice. |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -59,22 +59,12 @@ namespace Microsoft.ML.Trainers | |
/// In other cases, the output score vector is just $[\hat{y}^1, \dots, \hat{y}^m]$. | ||
/// | ||
/// ### Training Algorithm Details | ||
/// The optimization algorithm is an extension of (http://jmlr.org/papers/volume14/shalev-shwartz13a/shalev-shwartz13a.pdf) following a similar path proposed in an earlier [paper](https://www.csie.ntu.edu.tw/~cjlin/papers/maxent_dual.pdf). | ||
/// It is usually much faster than [L-BFGS](https://en.wikipedia.org/wiki/Limited-memory_BFGS) and [truncated Newton methods](https://en.wikipedia.org/wiki/Truncated_Newton_method) for large-scale and sparse data set. | ||
/// The optimization algorithm is an extension of [a coordinate descent method](http://jmlr.org/papers/volume14/shalev-shwartz13a/shalev-shwartz13a.pdf) | ||
/// following a similar path proposed in an earlier [paper](https://www.csie.ntu.edu.tw/~cjlin/papers/maxent_dual.pdf). | ||
/// It is usually much faster than [L-BFGS](https://en.wikipedia.org/wiki/Limited-memory_BFGS) and | ||
/// [truncated Newton methods](https://en.wikipedia.org/wiki/Truncated_Newton_method) for large-scale and sparse data sets. | ||
/// | ||
/// Regularization is a method that can render an ill-posed problem more tractable by imposing constraints that provide information to supplement the data and that prevents overfitting by penalizing model's magnitude usually measured by some norm functions. | ||
/// This can improve the generalization of the model learned by selecting the optimal complexity in the bias-variance tradeoff. | ||
/// Regularization works by adding the penalty on the magnitude of $\textbf{w}_c$, $c=1,\dots,m$ to the error of the hypothesis. | ||
/// An accurate model with extreme coefficient values would be penalized more, but a less accurate model with more conservative values would be penalized less. | ||
/// | ||
/// This trainer supports [elastic net regularization](https://en.wikipedia.org/wiki/Elastic_net_regularization): a linear combination of L1-norm (LASSO), $|| \textbf{w}_c ||_1$, and L2-norm (ridge), $|| \textbf{w}_c ||_2^2$ regularizations. | ||
/// L1-norm and L2-norm regularizations have different effects and uses that are complementary in certain respects. | ||
/// Using L1-norm can increase sparsity of the trained $\textbf{w}_c$. | ||
/// When working with high-dimensional data, it shrinks small weights of irrelevant features to 0 and therefore no resource will be spent on those bad features when making prediction. | ||
/// L2-norm regularization is preferable for data that is not sparse and it largely penalizes the existence of large weights. | ||
/// | ||
/// An aggressive regularization (that is, assigning large coefficients to L1-norm or L2-norm regularization terms) can harm predictive capacity by excluding important variables out of the model. | ||
/// Therefore, choosing the right regularization coefficients is important in practice. | ||
/// [!include[regularization](~/../docs/samples/docs/api-reference/regularization-l1-l2.md)] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why are we including this in the base class definition for multiclass and in the derived classes for binary classification? #ByDesign There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's a common behavior shared by all derived classes. #Resolved There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But it is inconsistent between the multiclass and binary ... #Resolved There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The reason is that they are written by different persons. Multiclass' XML doc is referenced in derived classes' XML docs, so there is no difference actually. I honestly don't have much time for writing style. In reply to: 278777687 [](ancestors = 278777687) |
||
/// | ||
/// Check the See Also section for links to usage examples. | ||
/// ]]> | ||
|
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need this file. you can just move it to the end of algo-details.sdca.md. It's ok that regularization details come after this, b/c it will be a separate section. #ByDesign
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks super strange... Regularization is the reason why SDCA can exist. As you may have known, SDCA solves "Dual form" of the original optimization problem. Without regularization, that dual form may not exist. Consequently, user should learn regularization before jumping into details of SDCA.
In reply to: 284914930 [](ancestors = 284914930)