Skip to content

L1-norm and L2-norm regularization doc #3586

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
May 28, 2019
Prev Previous commit
Next Next commit
Address comments
  • Loading branch information
wschin committed Apr 25, 2019
commit ad3d98e9f9acfe4f2f032b14a51334a367ee11f6
35 changes: 23 additions & 12 deletions docs/api-reference/algo-details-sdca.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,22 +28,33 @@ non-strongly-convex cases, you will get equally-good solutions from run to run.
For reproducible results, it is recommended that one sets 'Shuffle' to False and
'NumThreads' to 1.

This class use [empricial risk minimization](https://en.wikipedia.org/wiki/Empirical_risk_minimization) to formulate the optimized problem built upon collected data.
If the training data does not contain enough data points (for example, to train a linear model in $n$-dimensional space, we at least need $n$ data points),
(overfitting)(https://en.wikipedia.org/wiki/Overfitting) may happen so the trained model is good at describing training data but may fail to predict correct results in unseen events.
[Regularization](https://en.wikipedia.org/wiki/Regularization_(mathematics)) is a common technique to alleviate such a phenomenon by penalizing the magnitude (usually measureed by [norm function](https://en.wikipedia.org/wiki/Norm_(mathematics))) of model parameters.
This trainer supports [elastic net regularization](https://en.wikipedia.org/wiki/Elastic_net_regularization) which penalizing a linear combination of L1-norm (LASSO), $|| \textbf{w}_c ||_1$, and L2-norm (ridge), $|| \textbf{w}_c ||_2^2$ regularizations.
This class use [empricial risk minimization](https://en.wikipedia.org/wiki/Empirical_risk_minimization)
to formulate the optimization problem built upon collected data.
If the training data does not contain enough data points
(for example, to train a linear model in $n$-dimensional space, we at least need $n$ data points),
(overfitting)(https://en.wikipedia.org/wiki/Overfitting) may happen so that
the trained model is good at describing training data but may fail to predict correct results in unseen events.
[Regularization](https://en.wikipedia.org/wiki/Regularization_(mathematics)) is a common technique to alleviate
such a phenomenon by penalizing the magnitude (usually measured by
[norm function](https://en.wikipedia.org/wiki/Norm_(mathematics))) of model parameters.
This trainer supports [elastic net regularization](https://en.wikipedia.org/wiki/Elastic_net_regularization)
which penalizes a linear combination of L1-norm (LASSO), $|| \textbf{w}_c ||_1$, and L2-norm (ridge), $|| \textbf{w}_c ||_2^2$ regularizations for $c=1,\dots,m$.
L1-norm and L2-norm regularizations have different effects and uses that are complementary in certain respects.
Copy link

@shmoradims shmoradims Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add blank line for new paragraph. #Resolved

Togehter with the implemented optimization algorithm, L1-norm regularization can increase the sparsity of the $\textbf{w}_1,\dots,\textbf{w}_m$.
For high-dimention and sparse data set, if user carefully select the coefficient of L1-norm, it is possible to achieve a good prediction quality with a model with a few of non-zeros (e.g., 1% values) in $\textbf{w}_1,\dots,\textbf{w}_m$ without affecting its .
In contrast, L2-norm can not increase the sparsity of the trained model but can still prevernt overfitting by avoiding large parameter values.
Sometimes, using L2-norm leads to a better prediction quality, so user may still want try it and fine tune the coefficints of L1-norm and L2-norm.
Note that conceptually, using L1-norm implies that the distribution of all model parameters is a [Laplace distribution](https://en.wikipedia.org/wiki/Laplace_distribution) while

Together with the implemented optimization algorithm, L1-norm regularization can increase the sparsity of the model weights, $\textbf{w}_1,\dots,\textbf{w}_m$.
For high-dimension and sparse data set, if users carefully select the coefficient of L1-norm,
it is possible to achieve a good prediction quality with a model that has only a few non-zero weights
(e.g., 1% of total model weights) without affecting its prediction power.
In contrast, L2-norm can not increase the sparsity of the trained model but can still prevent overfitting by avoiding large parameter values.
Sometimes, using L2-norm leads to a better prediction quality, so users may still want to try it and fine tune the coefficints of L1-norm and L2-norm.
Note that conceptually, using L1-norm implies that the distribution of all model parameters is a
[Laplace distribution](https://en.wikipedia.org/wiki/Laplace_distribution) while
L2-norm means that a [Gaussian distribution](https://en.wikipedia.org/wiki/Normal_distribution) for them.

An aggressive regularization (that is, assigning large coefficients to L1-norm or L2-norm regularization terms) can harm predictive capacity by excluding important variables out of the model.
Therefore, choosing the right regularization coefficients is important in practice.
An aggressive regularization (that is, assigning large coefficients to L1-norm or L2-norm regularization terms)
can harm predictive capacity by excluding important variables out of the model.
For example, a very large L1-norm coefficient may force all parameters to be zeros and lead to a trivial model.
Therefore, choosing the right regularization coefficients is important in practice.

For more information, see:
* [Scaling Up Stochastic Dual Coordinate
Expand Down
41 changes: 27 additions & 14 deletions src/Microsoft.ML.StandardTrainers/Standard/SdcaMulticlass.cs
Original file line number Diff line number Diff line change
Expand Up @@ -59,25 +59,38 @@ namespace Microsoft.ML.Trainers
/// In other cases, the output score vector is just $[\hat{y}^1, \dots, \hat{y}^m]$.
///
/// ### Training Algorithm Details
/// The optimization algorithm is an extension of (http://jmlr.org/papers/volume14/shalev-shwartz13a/shalev-shwartz13a.pdf) following a similar path proposed in an earlier [paper](https://www.csie.ntu.edu.tw/~cjlin/papers/maxent_dual.pdf).
/// It is usually much faster than [L-BFGS](https://en.wikipedia.org/wiki/Limited-memory_BFGS) and [truncated Newton methods](https://en.wikipedia.org/wiki/Truncated_Newton_method) for large-scale and sparse data set.
/// The optimization algorithm is an extension of (http://jmlr.org/papers/volume14/shalev-shwartz13a/shalev-shwartz13a.pdf)
Copy link
Member

@najeeb-kazmi najeeb-kazmi Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(http://jmlr.org/papers/volume14/shalev-shwartz13a/shalev-shwartz13a.pdf) [](start = 54, length = 73)

Maybe give this link some display text [like this] ? #Resolved

/// following a similar path proposed in an earlier [paper](https://www.csie.ntu.edu.tw/~cjlin/papers/maxent_dual.pdf).
/// It is usually much faster than [L-BFGS](https://en.wikipedia.org/wiki/Limited-memory_BFGS) and
/// [truncated Newton methods](https://en.wikipedia.org/wiki/Truncated_Newton_method) for large-scale and sparse data set.
Copy link
Member

@najeeb-kazmi najeeb-kazmi Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data set [](start = 117, length = 8)

data sets #Resolved

///
/// This class use [empricial risk minimization](https://en.wikipedia.org/wiki/Empirical_risk_minimization) to formulate the optimized problem built upon collected data.
/// If the training data does not contain enough data points (for example, to train a linear model in $n$-dimensional space, we at least need $n$ data points),
/// (overfitting)(https://en.wikipedia.org/wiki/Overfitting) may happen so the trained model is good at describing training data but may fail to predict correct results in unseen events.
/// [Regularization](https://en.wikipedia.org/wiki/Regularization_(mathematics)) is a common technique to alleviate such a phenomenon by penalizing the magnitude (usually measureed by [norm function](https://en.wikipedia.org/wiki/Norm_(mathematics))) of model parameters.
/// This trainer supports [elastic net regularization](https://en.wikipedia.org/wiki/Elastic_net_regularization) which penalizing a linear combination of L1-norm (LASSO), $|| \textbf{w}_c ||_1$, and L2-norm (ridge), $|| \textbf{w}_c ||_2^2$ regularizations.
/// This class use [empricial risk minimization](https://en.wikipedia.org/wiki/Empirical_risk_minimization)
/// to formulate the optimization problem built upon collected data.
/// If the training data does not contain enough data points
/// (for example, to train a linear model in $n$-dimensional space, we at least need $n$ data points),
/// (overfitting)(https://en.wikipedia.org/wiki/Overfitting) may happen so that
/// the trained model is good at describing training data but may fail to predict correct results in unseen events.
/// [Regularization](https://en.wikipedia.org/wiki/Regularization_(mathematics)) is a common technique to alleviate
/// such a phenomenon by penalizing the magnitude (usually measured by
/// [norm function](https://en.wikipedia.org/wiki/Norm_(mathematics))) of model parameters.
/// This trainer supports [elastic net regularization](https://en.wikipedia.org/wiki/Elastic_net_regularization)
/// which penalizes a linear combination of L1-norm (LASSO), $|| \textbf{w}_c ||_1$, and L2-norm (ridge), $|| \textbf{w}_c ||_2^2$ regularizations for $c=1,\dots,m$.
/// L1-norm and L2-norm regularizations have different effects and uses that are complementary in certain respects.
/// Togehter with the implemented optimization algorithm, L1-norm regularization can increase the sparsity of the $\textbf{w}_1,\dots,\textbf{w}_m$.
/// For high-dimention and sparse data set, if user carefully select the coefficient of L1-norm, it is possible to achieve a good prediction quality with a model with a few of non-zeros (e.g., 1% values) in $\textbf{w}_1,\dots,\textbf{w}_m$ without affecting its .
/// In contrast, L2-norm can not increase the sparsity of the trained model but can still prevernt overfitting by avoiding large parameter values.
/// Sometimes, using L2-norm leads to a better prediction quality, so user may still want try it and fine tune the coefficints of L1-norm and L2-norm.
/// Note that conceptually, using L1-norm implies that the distribution of all model parameters is a [Laplace distribution](https://en.wikipedia.org/wiki/Laplace_distribution) while
///
/// Together with the implemented optimization algorithm, L1-norm regularization can increase the sparsity of the model weights, $\textbf{w}_1,\dots,\textbf{w}_m$.
/// For high-dimension and sparse data set, if users carefully select the coefficient of L1-norm,
/// it is possible to achieve a good prediction quality with a model that has only a few non-zero weights
/// (e.g., 1% of total model weights) without affecting its prediction power.
/// In contrast, L2-norm can not increase the sparsity of the trained model but can still prevent overfitting by avoiding large parameter values.
/// Sometimes, using L2-norm leads to a better prediction quality, so users may still want to try it and fine tune the coefficints of L1-norm and L2-norm.
/// Note that conceptually, using L1-norm implies that the distribution of all model parameters is a
/// [Laplace distribution](https://en.wikipedia.org/wiki/Laplace_distribution) while
/// L2-norm means that a [Gaussian distribution](https://en.wikipedia.org/wiki/Normal_distribution) for them.
///
/// An aggressive regularization (that is, assigning large coefficients to L1-norm or L2-norm regularization terms) can harm predictive capacity by excluding important variables out of the model.
/// Therefore, choosing the right regularization coefficients is important in practice.
/// An aggressive regularization (that is, assigning large coefficients to L1-norm or L2-norm regularization terms)
/// can harm predictive capacity by excluding important variables out of the model.
/// For example, a very large L1-norm coefficient may force all parameters to be zeros and lead to a trivial model.
/// Therefore, choosing the right regularization coefficients is important in practice.
///
/// Check the See Also section for links to usage examples.
/// ]]>
Expand Down