Skip to content

L1-norm and L2-norm regularization doc #3586

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
May 28, 2019
Next Next commit
L1-norm and L2-norm regularization doc
  • Loading branch information
wschin committed Apr 25, 2019
commit 1cb88a0370d6e52cb345a09487dd1849d497a1ac
40 changes: 16 additions & 24 deletions docs/api-reference/algo-details-sdca.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,35 +28,27 @@ non-strongly-convex cases, you will get equally-good solutions from run to run.
For reproducible results, it is recommended that one sets 'Shuffle' to False and
'NumThreads' to 1.

This learner supports [elastic net
regularization](https://en.wikipedia.org/wiki/Elastic_net_regularization): a
linear combination of the L1 and L2 penalties of the [lasso and ridge
methods](https://www.datacamp.com/community/tutorials/tutorial-ridge-lasso-elastic-net).
It can be specified by the 'L2Const' and 'L1Threshold' parameters. Note that the
'L2Const' has an effect on the number of needed training iterations. In general,
the larger the 'L2Const', the less number of iterations is needed to achieve a
reasonable solution. Regularization is a method that can render an ill-posed
problem more tractable and prevents overfitting by penalizing model's magnitude
usually measured by some norm functions. This can improve the generalization of
the model learned by selecting the optimal complexity in the [bias-variance
tradeoff](https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff).
Regularization works by adding the penalty that is associated with coefficient
values to the error of the hypothesis. An accurate model with extreme
coefficient values would be penalized more, but a less accurate model with more
conservative values would be penalized less.
This class use [empricial risk minimization](https://en.wikipedia.org/wiki/Empirical_risk_minimization) to formulate the optimized problem built upon collected data.
Copy link

@shmoradims shmoradims Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class use empricial risk minimization to formulat [](start = 0, length = 115)

please limit the line width, so that it's easy to review without going left and right. it's also a good practice for viewing the file on github. #Resolved

If the training data does not contain enough data points (for example, to train a linear model in $n$-dimensional space, we at least need $n$ data points),
(overfitting)(https://en.wikipedia.org/wiki/Overfitting) may happen so the trained model is good at describing training data but may fail to predict correct results in unseen events.
[Regularization](https://en.wikipedia.org/wiki/Regularization_(mathematics)) is a common technique to alleviate such a phenomenon by penalizing the magnitude (usually measureed by [norm function](https://en.wikipedia.org/wiki/Norm_(mathematics))) of model parameters.
Copy link

@shmoradims shmoradims Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

measureed [](start = 167, length = 9)

typo: measured #Resolved

This trainer supports [elastic net regularization](https://en.wikipedia.org/wiki/Elastic_net_regularization) which penalizing a linear combination of L1-norm (LASSO), $|| \textbf{w}_c ||_1$, and L2-norm (ridge), $|| \textbf{w}_c ||_2^2$ regularizations.
Copy link

@shmoradims shmoradims Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

penalizing [](start = 115, length = 10)

penalizes #Resolved

L1-norm and L2-norm regularizations have different effects and uses that are complementary in certain respects.
Copy link

@shmoradims shmoradims Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add blank line for new paragraph. #Resolved

Togehter with the implemented optimization algorithm, L1-norm regularization can increase the sparsity of the $\textbf{w}_1,\dots,\textbf{w}_m$.
Copy link

@shmoradims shmoradims Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$\textbf{w}_1,\dots,\textbf{w}_m$ [](start = 110, length = 33)

let's call this something like, model weights, or model parameters, and not repeat it below. #Resolved

For high-dimention and sparse data set, if user carefully select the coefficient of L1-norm, it is possible to achieve a good prediction quality with a model with a few of non-zeros (e.g., 1% values) in $\textbf{w}_1,\dots,\textbf{w}_m$ without affecting its .
Copy link

@shmoradims shmoradims Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

. [](start = 258, length = 2)

missing a word #Resolved

Copy link

@shmoradims shmoradims Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dimention [](start = 9, length = 9)

dimension

please use spell checker plugin #Resolved

Copy link
Member Author

@wschin wschin Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't show anything.. It was ok yesterday. Let me try Vim's.


In reply to: 278724688 [](ancestors = 278724688)

Copy link

@shmoradims shmoradims Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right, the plugin is useless in markdown sections. not sure it there's option to make it look at those sections. You might be able to temporarily remove the CDATA tags, and see if typos show up like the way they do in

#Resolved

Copy link

@shmoradims shmoradims Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with a few of non-zeros [](start = 158, length = 23)

that has only a few non-zero weights #Resolved

Copy link

@shmoradims shmoradims Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

values [](start = 192, length = 6)

1% of weights #Resolved

In contrast, L2-norm can not increase the sparsity of the trained model but can still prevernt overfitting by avoiding large parameter values.
Sometimes, using L2-norm leads to a better prediction quality, so user may still want try it and fine tune the coefficints of L1-norm and L2-norm.
Note that conceptually, using L1-norm implies that the distribution of all model parameters is a [Laplace distribution](https://en.wikipedia.org/wiki/Laplace_distribution) while
L2-norm means that a [Gaussian distribution](https://en.wikipedia.org/wiki/Normal_distribution) for them.

L1-nrom and L2-norm regularizations have different effects and uses that are
complementary in certain respects. Using L1-norm can increase sparsity of the
trained $\textbf{w}$. When working with high-dimensional data, it shrinks small
weights of irrevalent features to 0 and therefore no reource will be spent on
those bad features when making prediction. L2-norm regularization is preferable
for data that is not sparse and it largely penalizes the existence of large
weights.
An aggressive regularization (that is, assigning large coefficients to L1-norm or L2-norm regularization terms) can harm predictive capacity by excluding important variables out of the model.
Therefore, choosing the right regularization coefficients is important in practice.
For example, a very large L1-norm coefficient may force all parameters to be zeros and lead to a trivial model.

For more information, see:
* [Scaling Up Stochastic Dual Coordinate
Ascent.](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/06/main-3.pdf)
* [Stochastic Dual Coordinate Ascent Methods for Regularized Loss
Minimization.](http://www.jmlr.org/papers/volume14/shalev-shwartz13a/shalev-shwartz13a.pdf)

Check the See Also section for links to examples of the usage.
Check the See Also section for links to examples of the usage.
21 changes: 12 additions & 9 deletions src/Microsoft.ML.StandardTrainers/Standard/SdcaMulticlass.cs
Original file line number Diff line number Diff line change
Expand Up @@ -62,19 +62,22 @@ namespace Microsoft.ML.Trainers
/// The optimization algorithm is an extension of (http://jmlr.org/papers/volume14/shalev-shwartz13a/shalev-shwartz13a.pdf) following a similar path proposed in an earlier [paper](https://www.csie.ntu.edu.tw/~cjlin/papers/maxent_dual.pdf).
/// It is usually much faster than [L-BFGS](https://en.wikipedia.org/wiki/Limited-memory_BFGS) and [truncated Newton methods](https://en.wikipedia.org/wiki/Truncated_Newton_method) for large-scale and sparse data set.
///
/// Regularization is a method that can render an ill-posed problem more tractable by imposing constraints that provide information to supplement the data and that prevents overfitting by penalizing model's magnitude usually measured by some norm functions.
/// This can improve the generalization of the model learned by selecting the optimal complexity in the bias-variance tradeoff.
/// Regularization works by adding the penalty on the magnitude of $\textbf{w}_c$, $c=1,\dots,m$ to the error of the hypothesis.
/// An accurate model with extreme coefficient values would be penalized more, but a less accurate model with more conservative values would be penalized less.
///
/// This trainer supports [elastic net regularization](https://en.wikipedia.org/wiki/Elastic_net_regularization): a linear combination of L1-norm (LASSO), $|| \textbf{w}_c ||_1$, and L2-norm (ridge), $|| \textbf{w}_c ||_2^2$ regularizations.
/// This class use [empricial risk minimization](https://en.wikipedia.org/wiki/Empirical_risk_minimization) to formulate the optimized problem built upon collected data.
Copy link
Member

@ganik ganik Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optimized [](start = 129, length = 9)

optimization ? #Resolved

/// If the training data does not contain enough data points (for example, to train a linear model in $n$-dimensional space, we at least need $n$ data points),
/// (overfitting)(https://en.wikipedia.org/wiki/Overfitting) may happen so the trained model is good at describing training data but may fail to predict correct results in unseen events.
Copy link
Member

@ganik ganik Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so [](start = 76, length = 2)

when #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be so that.


In reply to: 278724124 [](ancestors = 278724124)

/// [Regularization](https://en.wikipedia.org/wiki/Regularization_(mathematics)) is a common technique to alleviate such a phenomenon by penalizing the magnitude (usually measureed by [norm function](https://en.wikipedia.org/wiki/Norm_(mathematics))) of model parameters.
/// This trainer supports [elastic net regularization](https://en.wikipedia.org/wiki/Elastic_net_regularization) which penalizing a linear combination of L1-norm (LASSO), $|| \textbf{w}_c ||_1$, and L2-norm (ridge), $|| \textbf{w}_c ||_2^2$ regularizations.
/// L1-norm and L2-norm regularizations have different effects and uses that are complementary in certain respects.
/// Using L1-norm can increase sparsity of the trained $\textbf{w}_c$.
/// When working with high-dimensional data, it shrinks small weights of irrelevant features to 0 and therefore no resource will be spent on those bad features when making prediction.
/// L2-norm regularization is preferable for data that is not sparse and it largely penalizes the existence of large weights.
/// Togehter with the implemented optimization algorithm, L1-norm regularization can increase the sparsity of the $\textbf{w}_1,\dots,\textbf{w}_m$.
Copy link
Member

@ganik ganik Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Togehter [](start = 8, length = 8)

typo #Resolved

Copy link
Member

@ganik ganik Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Togehter [](start = 8, length = 8)

typo #Resolved

Copy link
Member

@ganik ganik Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Togehter [](start = 8, length = 8)

uses #Pending

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you mean users.


In reply to: 278723627 [](ancestors = 278723627)

/// For high-dimention and sparse data set, if user carefully select the coefficient of L1-norm, it is possible to achieve a good prediction quality with a model with a few of non-zeros (e.g., 1% values) in $\textbf{w}_1,\dots,\textbf{w}_m$ without affecting its .
/// In contrast, L2-norm can not increase the sparsity of the trained model but can still prevernt overfitting by avoiding large parameter values.
/// Sometimes, using L2-norm leads to a better prediction quality, so user may still want try it and fine tune the coefficints of L1-norm and L2-norm.
/// Note that conceptually, using L1-norm implies that the distribution of all model parameters is a [Laplace distribution](https://en.wikipedia.org/wiki/Laplace_distribution) while
/// L2-norm means that a [Gaussian distribution](https://en.wikipedia.org/wiki/Normal_distribution) for them.
///
/// An aggressive regularization (that is, assigning large coefficients to L1-norm or L2-norm regularization terms) can harm predictive capacity by excluding important variables out of the model.
/// Therefore, choosing the right regularization coefficients is important in practice.
/// For example, a very large L1-norm coefficient may force all parameters to be zeros and lead to a trivial model.
///
/// Check the See Also section for links to usage examples.
/// ]]>
Expand Down