Skip to content

L1-norm and L2-norm regularization doc #3586

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
May 28, 2019
8 changes: 8 additions & 0 deletions docs/api-reference/algo-details-sdca-refs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
For more information, see:
* [Scaling Up Stochastic Dual Coordinate
Ascent.](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/06/main-3.pdf)
* [Stochastic Dual Coordinate Ascent Methods for Regularized Loss
Minimization.](http://www.jmlr.org/papers/volume14/shalev-shwartz13a/shalev-shwartz13a.pdf)


Check the See Also section for links to examples of the usage.
Copy link

@shmoradims shmoradims May 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need this file. you can just move it to the end of algo-details.sdca.md. It's ok that regularization details come after this, b/c it will be a separate section. #ByDesign

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks super strange... Regularization is the reason why SDCA can exist. As you may have known, SDCA solves "Dual form" of the original optimization problem. Without regularization, that dual form may not exist. Consequently, user should learn regularization before jumping into details of SDCA.


In reply to: 284914930 [](ancestors = 284914930)

33 changes: 0 additions & 33 deletions docs/api-reference/algo-details-sdca.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,36 +27,3 @@ and therefore everyone eventually reaches the same place. Even in
non-strongly-convex cases, you will get equally-good solutions from run to run.
For reproducible results, it is recommended that one sets 'Shuffle' to False and
'NumThreads' to 1.

This learner supports [elastic net
regularization](https://en.wikipedia.org/wiki/Elastic_net_regularization): a
linear combination of the L1 and L2 penalties of the [lasso and ridge
methods](https://www.datacamp.com/community/tutorials/tutorial-ridge-lasso-elastic-net).
It can be specified by the 'L2Const' and 'L1Threshold' parameters. Note that the
'L2Const' has an effect on the number of needed training iterations. In general,
the larger the 'L2Const', the less number of iterations is needed to achieve a
reasonable solution. Regularization is a method that can render an ill-posed
problem more tractable and prevents overfitting by penalizing model's magnitude
usually measured by some norm functions. This can improve the generalization of
the model learned by selecting the optimal complexity in the [bias-variance
tradeoff](https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff).
Regularization works by adding the penalty that is associated with coefficient
values to the error of the hypothesis. An accurate model with extreme
coefficient values would be penalized more, but a less accurate model with more
conservative values would be penalized less.

L1-nrom and L2-norm regularizations have different effects and uses that are
complementary in certain respects. Using L1-norm can increase sparsity of the
trained $\textbf{w}$. When working with high-dimensional data, it shrinks small
weights of irrevalent features to 0 and therefore no reource will be spent on
those bad features when making prediction. L2-norm regularization is preferable
for data that is not sparse and it largely penalizes the existence of large
weights.

For more information, see:
* [Scaling Up Stochastic Dual Coordinate
Ascent.](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/06/main-3.pdf)
* [Stochastic Dual Coordinate Ascent Methods for Regularized Loss
Minimization.](http://www.jmlr.org/papers/volume14/shalev-shwartz13a/shalev-shwartz13a.pdf)

Check the See Also section for links to examples of the usage.
27 changes: 27 additions & 0 deletions docs/api-reference/regularization-l1-l2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
This class use [empricial risk minimization](https://en.wikipedia.org/wiki/Empirical_risk_minimization)
Copy link
Contributor

@natke natke Apr 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this in the l1-norm and l2-norm regularization include? #Resolved

Copy link
Member Author

@wschin wschin Apr 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are optimize regularized ERM. ER is also known as loss function. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add Note that empricial risk is usually measured by applying a loss function on the model's predictions on collected data points.


In reply to: 278778511 [](ancestors = 278778511)

to formulate the optimization problem built upon collected data.
If the training data does not contain enough data points
(for example, to train a linear model in $n$-dimensional space, we need at least $n$ data points),
[overfitting](https://en.wikipedia.org/wiki/Overfitting) may happen so that
the trained model is good at describing training data but may fail to predict correct results in unseen events.
[Regularization](https://en.wikipedia.org/wiki/Regularization_(mathematics)) is a common technique to alleviate
such a phenomenon by penalizing the magnitude (usually measured by
[norm function](https://en.wikipedia.org/wiki/Norm_(mathematics))) of model parameters.
This trainer supports [elastic net regularization](https://en.wikipedia.org/wiki/Elastic_net_regularization),
which penalizes a linear combination of L1-norm (LASSO), $|| \textbf{w}_c ||_1$, and L2-norm (ridge), $|| \textbf{w}_c ||_2^2$ regularizations for $c=1,\dots,m$.
L1-norm and L2-norm regularizations have different effects and uses that are complementary in certain respects.

Together with the implemented optimization algorithm, L1-norm regularization can increase the sparsity of the model weights, $\textbf{w}_1,\dots,\textbf{w}_m$.
For high-dimensional and sparse data sets, if users carefully select the coefficient of L1-norm,
it is possible to achieve a good prediction quality with a model that has only a few non-zero weights
(e.g., 1% of total model weights) without affecting its prediction power.
In contrast, L2-norm can not increase the sparsity of the trained model but can still prevent overfitting by avoiding large parameter values.
Sometimes, using L2-norm leads to a better prediction quality, so users may still want to try it and fine tune the coefficients of L1-norm and L2-norm.
Note that conceptually, using L1-norm implies that the distribution of all model parameters is a
[Laplace distribution](https://en.wikipedia.org/wiki/Laplace_distribution) while
L2-norm implies a [Gaussian distribution](https://en.wikipedia.org/wiki/Normal_distribution) for them.

An aggressive regularization (that is, assigning large coefficients to L1-norm or L2-norm regularization terms)
can harm predictive capacity by excluding important variables from the model.
For example, a very large L1-norm coefficient may force all parameters to be zeros and lead to a trivial model.
Therefore, choosing the right regularization coefficients is important in practice.
Original file line number Diff line number Diff line change
Expand Up @@ -76,20 +76,8 @@ namespace Microsoft.ML.Trainers
/// Since L-BFGS approximation uses only a limited amount of historical states to compute the next step direction, it is especially suited for problems with high-dimensional feature vector.
/// The number of historical states is a user-specified parameter, using a larger number may lead to a better approximation to the Hessian matrix but also a higher computation cost per step.
///
/// Regularization is a method that can render an ill-posed problem more tractable by imposing constraints that provide information to supplement the data and that prevents overfitting by penalizing model's magnitude usually measured by some norm functions.
/// This can improve the generalization of the model learned by selecting the optimal complexity in the bias-variance trade-off.
/// Regularization works by adding the penalty that is associated with coefficient values to the error of the hypothesis.
/// An accurate model with extreme coefficient values would be penalized more, but a less accurate model with more conservative values would be penalized less.
/// [!include[io](~/../docs/samples/docs/api-reference/regularization-l1-l2.md)]
///
/// This learner supports [elastic net regularization](https://en.wikipedia.org/wiki/Elastic_net_regularization): a linear combination of L1-norm (LASSO), $|| \textbf{w} ||_1$, and L2-norm (ridge), $|| \textbf{w} ||_2^2$ regularizations.
/// L1-norm and L2-norm regularizations have different effects and uses that are complementary in certain respects.
/// Using L1-norm can increase sparsity of the trained $\textbf{w}$.
/// When working with high-dimensional data, it shrinks small weights of irrelevant features to 0 and therefore no resource will be spent on those bad features when making prediction.
/// If L1-norm regularization is used, the used training algorithm would be [QWL-QN](http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.68.5260).
/// L2-norm regularization is preferable for data that is not sparse and it largely penalizes the existence of large weights.
///
/// An aggressive regularization (that is, assigning large coefficients to L1-norm or L2-norm regularization terms) can harm predictive capacity by excluding important variables out of the model.
/// Therefore, choosing the right regularization coefficients is important when applying maximum entropy classifier.
///
/// Check the See Also section for links to usage examples.
/// ]]>
Expand Down
8 changes: 8 additions & 0 deletions src/Microsoft.ML.StandardTrainers/Standard/SdcaBinary.cs
Original file line number Diff line number Diff line change
Expand Up @@ -1566,6 +1566,10 @@ private protected override BinaryPredictionTransformer<TModelParameters> MakeTra
/// | Required NuGet in addition to Microsoft.ML | None |
///
/// [!include[algorithm](~/../docs/samples/docs/api-reference/algo-details-sdca.md)]
///
/// [!include[regularization](~/../docs/samples/docs/api-reference/regularization-l1-l2.md)]
///
/// [!include[references](~/../docs/samples/docs/api-reference/algo-details-sdca-refs.md)]
/// ]]>
/// </format>
/// </remarks>
Expand Down Expand Up @@ -1652,6 +1656,10 @@ private protected override SchemaShape.Column[] ComputeSdcaBinaryClassifierSchem
/// | Required NuGet in addition to Microsoft.ML | None |
///
/// [!include[algorithm](~/../docs/samples/docs/api-reference/algo-details-sdca.md)]
///
/// [!include[regularization](~/../docs/samples/docs/api-reference/regularization-l1-l2.md)]
///
/// [!include[references](~/../docs/samples/docs/api-reference/algo-details-sdca-refs.md)]
/// ]]>
/// </format>
/// </remarks>
Expand Down
20 changes: 5 additions & 15 deletions src/Microsoft.ML.StandardTrainers/Standard/SdcaMulticlass.cs
Original file line number Diff line number Diff line change
Expand Up @@ -59,22 +59,12 @@ namespace Microsoft.ML.Trainers
/// In other cases, the output score vector is just $[\hat{y}^1, \dots, \hat{y}^m]$.
///
/// ### Training Algorithm Details
/// The optimization algorithm is an extension of (http://jmlr.org/papers/volume14/shalev-shwartz13a/shalev-shwartz13a.pdf) following a similar path proposed in an earlier [paper](https://www.csie.ntu.edu.tw/~cjlin/papers/maxent_dual.pdf).
/// It is usually much faster than [L-BFGS](https://en.wikipedia.org/wiki/Limited-memory_BFGS) and [truncated Newton methods](https://en.wikipedia.org/wiki/Truncated_Newton_method) for large-scale and sparse data set.
/// The optimization algorithm is an extension of [a coordinate descent method](http://jmlr.org/papers/volume14/shalev-shwartz13a/shalev-shwartz13a.pdf)
/// following a similar path proposed in an earlier [paper](https://www.csie.ntu.edu.tw/~cjlin/papers/maxent_dual.pdf).
/// It is usually much faster than [L-BFGS](https://en.wikipedia.org/wiki/Limited-memory_BFGS) and
/// [truncated Newton methods](https://en.wikipedia.org/wiki/Truncated_Newton_method) for large-scale and sparse data sets.
///
/// Regularization is a method that can render an ill-posed problem more tractable by imposing constraints that provide information to supplement the data and that prevents overfitting by penalizing model's magnitude usually measured by some norm functions.
/// This can improve the generalization of the model learned by selecting the optimal complexity in the bias-variance tradeoff.
/// Regularization works by adding the penalty on the magnitude of $\textbf{w}_c$, $c=1,\dots,m$ to the error of the hypothesis.
/// An accurate model with extreme coefficient values would be penalized more, but a less accurate model with more conservative values would be penalized less.
///
/// This trainer supports [elastic net regularization](https://en.wikipedia.org/wiki/Elastic_net_regularization): a linear combination of L1-norm (LASSO), $|| \textbf{w}_c ||_1$, and L2-norm (ridge), $|| \textbf{w}_c ||_2^2$ regularizations.
/// L1-norm and L2-norm regularizations have different effects and uses that are complementary in certain respects.
/// Using L1-norm can increase sparsity of the trained $\textbf{w}_c$.
/// When working with high-dimensional data, it shrinks small weights of irrelevant features to 0 and therefore no resource will be spent on those bad features when making prediction.
/// L2-norm regularization is preferable for data that is not sparse and it largely penalizes the existence of large weights.
///
/// An aggressive regularization (that is, assigning large coefficients to L1-norm or L2-norm regularization terms) can harm predictive capacity by excluding important variables out of the model.
/// Therefore, choosing the right regularization coefficients is important in practice.
/// [!include[regularization](~/../docs/samples/docs/api-reference/regularization-l1-l2.md)]
Copy link
Contributor

@natke natke Apr 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we including this in the base class definition for multiclass and in the derived classes for binary classification? #ByDesign

Copy link
Member Author

@wschin wschin Apr 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a common behavior shared by all derived classes. #Resolved

Copy link
Contributor

@natke natke Apr 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it is inconsistent between the multiclass and binary ... #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? You mean their doc contents are different?


In reply to: 279005984 [](ancestors = 279005984)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason is that they are written by different persons. Multiclass' XML doc is referenced in derived classes' XML docs, so there is no difference actually. I honestly don't have much time for writing style.


In reply to: 278777687 [](ancestors = 278777687)

///
/// Check the See Also section for links to usage examples.
/// ]]>
Expand Down
4 changes: 4 additions & 0 deletions src/Microsoft.ML.StandardTrainers/Standard/SdcaRegression.cs
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,10 @@ namespace Microsoft.ML.Trainers
/// | Required NuGet in addition to Microsoft.ML | None |
///
/// [!include[algorithm](~/../docs/samples/docs/api-reference/algo-details-sdca.md)]
///
/// [!include[regularization](~/../docs/samples/docs/api-reference/regularization-l1-l2.md)]
///
/// [!include[references](~/../docs/samples/docs/api-reference/algo-details-sdca-refs.md)]
/// ]]>
/// </format>
/// </remarks>
Expand Down