-
Notifications
You must be signed in to change notification settings - Fork 1.9k
L1-norm and L2-norm regularization doc #3586
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This class use [empricial risk minimization](https://en.wikipedia.org/wiki/Empirical_risk_minimization) to formulate the optimized problem built upon collected data. | ||
If the training data does not contain enough data points (for example, to train a linear model in $n$-dimensional space, we at least need $n$ data points), | ||
(overfitting)(https://en.wikipedia.org/wiki/Overfitting) may happen so the trained model is good at describing training data but may fail to predict correct results in unseen events. | ||
[Regularization](https://en.wikipedia.org/wiki/Regularization_(mathematics)) is a common technique to alleviate such a phenomenon by penalizing the magnitude (usually measureed by [norm function](https://en.wikipedia.org/wiki/Norm_(mathematics))) of model parameters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
measureed [](start = 167, length = 9)
typo: measured #Resolved
If the training data does not contain enough data points (for example, to train a linear model in $n$-dimensional space, we at least need $n$ data points), | ||
(overfitting)(https://en.wikipedia.org/wiki/Overfitting) may happen so the trained model is good at describing training data but may fail to predict correct results in unseen events. | ||
[Regularization](https://en.wikipedia.org/wiki/Regularization_(mathematics)) is a common technique to alleviate such a phenomenon by penalizing the magnitude (usually measureed by [norm function](https://en.wikipedia.org/wiki/Norm_(mathematics))) of model parameters. | ||
This trainer supports [elastic net regularization](https://en.wikipedia.org/wiki/Elastic_net_regularization) which penalizing a linear combination of L1-norm (LASSO), $|| \textbf{w}_c ||_1$, and L2-norm (ridge), $|| \textbf{w}_c ||_2^2$ regularizations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
penalizing [](start = 115, length = 10)
penalizes #Resolved
/// Using L1-norm can increase sparsity of the trained $\textbf{w}_c$. | ||
/// When working with high-dimensional data, it shrinks small weights of irrelevant features to 0 and therefore no resource will be spent on those bad features when making prediction. | ||
/// L2-norm regularization is preferable for data that is not sparse and it largely penalizes the existence of large weights. | ||
/// Togehter with the implemented optimization algorithm, L1-norm regularization can increase the sparsity of the $\textbf{w}_1,\dots,\textbf{w}_m$. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Togehter [](start = 8, length = 8)
typo #Resolved
/// Using L1-norm can increase sparsity of the trained $\textbf{w}_c$. | ||
/// When working with high-dimensional data, it shrinks small weights of irrelevant features to 0 and therefore no resource will be spent on those bad features when making prediction. | ||
/// L2-norm regularization is preferable for data that is not sparse and it largely penalizes the existence of large weights. | ||
/// Togehter with the implemented optimization algorithm, L1-norm regularization can increase the sparsity of the $\textbf{w}_1,\dots,\textbf{w}_m$. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Togehter [](start = 8, length = 8)
typo #Resolved
This trainer supports [elastic net regularization](https://en.wikipedia.org/wiki/Elastic_net_regularization) which penalizing a linear combination of L1-norm (LASSO), $|| \textbf{w}_c ||_1$, and L2-norm (ridge), $|| \textbf{w}_c ||_2^2$ regularizations. | ||
L1-norm and L2-norm regularizations have different effects and uses that are complementary in certain respects. | ||
Togehter with the implemented optimization algorithm, L1-norm regularization can increase the sparsity of the $\textbf{w}_1,\dots,\textbf{w}_m$. | ||
For high-dimention and sparse data set, if user carefully select the coefficient of L1-norm, it is possible to achieve a good prediction quality with a model with a few of non-zeros (e.g., 1% values) in $\textbf{w}_1,\dots,\textbf{w}_m$ without affecting its . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
. [](start = 258, length = 2)
missing a word #Resolved
/// Using L1-norm can increase sparsity of the trained $\textbf{w}_c$. | ||
/// When working with high-dimensional data, it shrinks small weights of irrelevant features to 0 and therefore no resource will be spent on those bad features when making prediction. | ||
/// L2-norm regularization is preferable for data that is not sparse and it largely penalizes the existence of large weights. | ||
/// Togehter with the implemented optimization algorithm, L1-norm regularization can increase the sparsity of the $\textbf{w}_1,\dots,\textbf{w}_m$. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Togehter [](start = 8, length = 8)
uses #Pending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// An accurate model with extreme coefficient values would be penalized more, but a less accurate model with more conservative values would be penalized less. | ||
/// | ||
/// This trainer supports [elastic net regularization](https://en.wikipedia.org/wiki/Elastic_net_regularization): a linear combination of L1-norm (LASSO), $|| \textbf{w}_c ||_1$, and L2-norm (ridge), $|| \textbf{w}_c ||_2^2$ regularizations. | ||
/// This class use [empricial risk minimization](https://en.wikipedia.org/wiki/Empirical_risk_minimization) to formulate the optimized problem built upon collected data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optimized [](start = 129, length = 9)
optimization ? #Resolved
/// This trainer supports [elastic net regularization](https://en.wikipedia.org/wiki/Elastic_net_regularization): a linear combination of L1-norm (LASSO), $|| \textbf{w}_c ||_1$, and L2-norm (ridge), $|| \textbf{w}_c ||_2^2$ regularizations. | ||
/// This class use [empricial risk minimization](https://en.wikipedia.org/wiki/Empirical_risk_minimization) to formulate the optimized problem built upon collected data. | ||
/// If the training data does not contain enough data points (for example, to train a linear model in $n$-dimensional space, we at least need $n$ data points), | ||
/// (overfitting)(https://en.wikipedia.org/wiki/Overfitting) may happen so the trained model is good at describing training data but may fail to predict correct results in unseen events. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so [](start = 76, length = 2)
when #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This trainer supports [elastic net regularization](https://en.wikipedia.org/wiki/Elastic_net_regularization) which penalizing a linear combination of L1-norm (LASSO), $|| \textbf{w}_c ||_1$, and L2-norm (ridge), $|| \textbf{w}_c ||_2^2$ regularizations. | ||
L1-norm and L2-norm regularizations have different effects and uses that are complementary in certain respects. | ||
Togehter with the implemented optimization algorithm, L1-norm regularization can increase the sparsity of the $\textbf{w}_1,\dots,\textbf{w}_m$. | ||
For high-dimention and sparse data set, if user carefully select the coefficient of L1-norm, it is possible to achieve a good prediction quality with a model with a few of non-zeros (e.g., 1% values) in $\textbf{w}_1,\dots,\textbf{w}_m$ without affecting its . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dimention [](start = 9, length = 9)
dimension
please use spell checker plugin #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't show anything.. It was ok yesterday. Let me try Vim's.
In reply to: 278724688 [](ancestors = 278724688)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right, the plugin is useless in markdown sections. not sure it there's option to make it look at those sections. You might be able to temporarily remove the CDATA tags, and see if typos show up like the way they do in
#Resolved
This trainer supports [elastic net regularization](https://en.wikipedia.org/wiki/Elastic_net_regularization) which penalizing a linear combination of L1-norm (LASSO), $|| \textbf{w}_c ||_1$, and L2-norm (ridge), $|| \textbf{w}_c ||_2^2$ regularizations. | ||
L1-norm and L2-norm regularizations have different effects and uses that are complementary in certain respects. | ||
Togehter with the implemented optimization algorithm, L1-norm regularization can increase the sparsity of the $\textbf{w}_1,\dots,\textbf{w}_m$. | ||
For high-dimention and sparse data set, if user carefully select the coefficient of L1-norm, it is possible to achieve a good prediction quality with a model with a few of non-zeros (e.g., 1% values) in $\textbf{w}_1,\dots,\textbf{w}_m$ without affecting its . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with a few of non-zeros [](start = 158, length = 23)
that has only a few non-zero weights #Resolved
This trainer supports [elastic net regularization](https://en.wikipedia.org/wiki/Elastic_net_regularization) which penalizing a linear combination of L1-norm (LASSO), $|| \textbf{w}_c ||_1$, and L2-norm (ridge), $|| \textbf{w}_c ||_2^2$ regularizations. | ||
L1-norm and L2-norm regularizations have different effects and uses that are complementary in certain respects. | ||
Togehter with the implemented optimization algorithm, L1-norm regularization can increase the sparsity of the $\textbf{w}_1,\dots,\textbf{w}_m$. | ||
For high-dimention and sparse data set, if user carefully select the coefficient of L1-norm, it is possible to achieve a good prediction quality with a model with a few of non-zeros (e.g., 1% values) in $\textbf{w}_1,\dots,\textbf{w}_m$ without affecting its . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
values [](start = 192, length = 6)
1% of weights #Resolved
(overfitting)(https://en.wikipedia.org/wiki/Overfitting) may happen so the trained model is good at describing training data but may fail to predict correct results in unseen events. | ||
[Regularization](https://en.wikipedia.org/wiki/Regularization_(mathematics)) is a common technique to alleviate such a phenomenon by penalizing the magnitude (usually measureed by [norm function](https://en.wikipedia.org/wiki/Norm_(mathematics))) of model parameters. | ||
This trainer supports [elastic net regularization](https://en.wikipedia.org/wiki/Elastic_net_regularization) which penalizing a linear combination of L1-norm (LASSO), $|| \textbf{w}_c ||_1$, and L2-norm (ridge), $|| \textbf{w}_c ||_2^2$ regularizations. | ||
L1-norm and L2-norm regularizations have different effects and uses that are complementary in certain respects. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add blank line for new paragraph. #Resolved
[Regularization](https://en.wikipedia.org/wiki/Regularization_(mathematics)) is a common technique to alleviate such a phenomenon by penalizing the magnitude (usually measureed by [norm function](https://en.wikipedia.org/wiki/Norm_(mathematics))) of model parameters. | ||
This trainer supports [elastic net regularization](https://en.wikipedia.org/wiki/Elastic_net_regularization) which penalizing a linear combination of L1-norm (LASSO), $|| \textbf{w}_c ||_1$, and L2-norm (ridge), $|| \textbf{w}_c ||_2^2$ regularizations. | ||
L1-norm and L2-norm regularizations have different effects and uses that are complementary in certain respects. | ||
Togehter with the implemented optimization algorithm, L1-norm regularization can increase the sparsity of the $\textbf{w}_1,\dots,\textbf{w}_m$. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
$\textbf{w}_1,\dots,\textbf{w}_m$ [](start = 110, length = 33)
let's call this something like, model weights, or model parameters, and not repeat it below. #Resolved
values to the error of the hypothesis. An accurate model with extreme | ||
coefficient values would be penalized more, but a less accurate model with more | ||
conservative values would be penalized less. | ||
This class use [empricial risk minimization](https://en.wikipedia.org/wiki/Empirical_risk_minimization) to formulate the optimized problem built upon collected data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class use empricial risk minimization to formulat [](start = 0, length = 115)
please limit the line width, so that it's easy to review without going left and right. it's also a good practice for viewing the file on github. #Resolved
This class use [empricial risk minimization](https://en.wikipedia.org/wiki/Empirical_risk_minimization) | ||
to formulate the optimization problem built upon collected data. | ||
If the training data does not contain enough data points | ||
(for example, to train a linear model in $n$-dimensional space, we at least need $n$ data points), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at least need
$n$ [](start = 67, length = 17)
need at least
[Regularization](https://en.wikipedia.org/wiki/Regularization_(mathematics)) is a common technique to alleviate | ||
such a phenomenon by penalizing the magnitude (usually measured by | ||
[norm function](https://en.wikipedia.org/wiki/Norm_(mathematics))) of model parameters. | ||
This trainer supports [elastic net regularization](https://en.wikipedia.org/wiki/Elastic_net_regularization) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
) [](start = 107, length = 1)
Add comma after ) (in general whenever the next word is "which") #Resolved
Sometimes, using L2-norm leads to a better prediction quality, so users may still want to try it and fine tune the coefficients of L1-norm and L2-norm. | ||
Note that conceptually, using L1-norm implies that the distribution of all model parameters is a | ||
[Laplace distribution](https://en.wikipedia.org/wiki/Laplace_distribution) while | ||
L2-norm means that a [Gaussian distribution](https://en.wikipedia.org/wiki/Normal_distribution) for them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
means that a [](start = 8, length = 12)
"assumes a Gaussian distribution" or "implies a Gaussian distribution" #Resolved
L2-norm means that a [Gaussian distribution](https://en.wikipedia.org/wiki/Normal_distribution) for them. | ||
|
||
An aggressive regularization (that is, assigning large coefficients to L1-norm or L2-norm regularization terms) | ||
can harm predictive capacity by excluding important variables out of the model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out of the model [](start = 62, length = 16)
from the model #Resolved
@@ -59,22 +59,12 @@ namespace Microsoft.ML.Trainers | |||
/// In other cases, the output score vector is just $[\hat{y}^1, \dots, \hat{y}^m]$. | |||
/// | |||
/// ### Training Algorithm Details | |||
/// The optimization algorithm is an extension of (http://jmlr.org/papers/volume14/shalev-shwartz13a/shalev-shwartz13a.pdf) following a similar path proposed in an earlier [paper](https://www.csie.ntu.edu.tw/~cjlin/papers/maxent_dual.pdf). | |||
/// It is usually much faster than [L-BFGS](https://en.wikipedia.org/wiki/Limited-memory_BFGS) and [truncated Newton methods](https://en.wikipedia.org/wiki/Truncated_Newton_method) for large-scale and sparse data set. | |||
/// The optimization algorithm is an extension of (http://jmlr.org/papers/volume14/shalev-shwartz13a/shalev-shwartz13a.pdf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(http://jmlr.org/papers/volume14/shalev-shwartz13a/shalev-shwartz13a.pdf) [](start = 54, length = 73)
Maybe give this link some display text [like this] ? #Resolved
/// The optimization algorithm is an extension of (http://jmlr.org/papers/volume14/shalev-shwartz13a/shalev-shwartz13a.pdf) | ||
/// following a similar path proposed in an earlier [paper](https://www.csie.ntu.edu.tw/~cjlin/papers/maxent_dual.pdf). | ||
/// It is usually much faster than [L-BFGS](https://en.wikipedia.org/wiki/Limited-memory_BFGS) and | ||
/// [truncated Newton methods](https://en.wikipedia.org/wiki/Truncated_Newton_method) for large-scale and sparse data set. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data set [](start = 117, length = 8)
data sets #Resolved
to formulate the optimization problem built upon collected data. | ||
If the training data does not contain enough data points | ||
(for example, to train a linear model in $n$-dimensional space, we at least need $n$ data points), | ||
(overfitting)(https://en.wikipedia.org/wiki/Overfitting) may happen so that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(overfitting) [](start = 0, length = 13)
square brackets [overfitting] #Resolved
L1-norm and L2-norm regularizations have different effects and uses that are complementary in certain respects. | ||
|
||
Together with the implemented optimization algorithm, L1-norm regularization can increase the sparsity of the model weights, $\textbf{w}_1,\dots,\textbf{w}_m$. | ||
For high-dimension and sparse data set, if users carefully select the coefficient of L1-norm, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
high-dimension [](start = 4, length = 14)
"high-dimensional" perhaps? #Resolved
L1-norm and L2-norm regularizations have different effects and uses that are complementary in certain respects. | ||
|
||
Together with the implemented optimization algorithm, L1-norm regularization can increase the sparsity of the model weights, $\textbf{w}_1,\dots,\textbf{w}_m$. | ||
For high-dimension and sparse data set, if users carefully select the coefficient of L1-norm, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data set [](start = 30, length = 8)
data sets #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -0,0 +1,27 @@ | |||
This class use [empricial risk minimization](https://en.wikipedia.org/wiki/Empirical_risk_minimization) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this in the l1-norm and l2-norm regularization include? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are optimize regularized
ERM. ER
is also known as loss function. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add Note that empricial risk is usually measured by applying a loss function on the model's predictions on collected data points.
In reply to: 278778511 [](ancestors = 278778511)
/// | ||
/// An aggressive regularization (that is, assigning large coefficients to L1-norm or L2-norm regularization terms) can harm predictive capacity by excluding important variables out of the model. | ||
/// Therefore, choosing the right regularization coefficients is important in practice. | ||
/// [!include[regularization](~/../docs/samples/docs/api-reference/regularization-l1-l2.md)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we including this in the base class definition for multiclass and in the derived classes for binary classification? #ByDesign
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a common behavior shared by all derived classes. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it is inconsistent between the multiclass and binary ... #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason is that they are written by different persons. Multiclass' XML doc is referenced in derived classes' XML docs, so there is no difference actually. I honestly don't have much time for writing style.
In reply to: 278777687 [](ancestors = 278777687)
Co-Authored-By: wschin <[email protected]>
Co-Authored-By: wschin <[email protected]>
Codecov Report
@@ Coverage Diff @@
## master #3586 +/- ##
=========================================
Coverage ? 72.76%
=========================================
Files ? 808
Lines ? 145458
Branches ? 16244
=========================================
Hits ? 105844
Misses ? 35191
Partials ? 4423
|
/// | ||
/// An aggressive regularization (that is, assigning large coefficients to L1-norm or L2-norm regularization terms) can harm predictive capacity by excluding important variables out of the model. | ||
/// Therefore, choosing the right regularization coefficients is important in practice. | ||
/// [!include[regularization](~/../docs/samples/docs/api-reference/regularization-l1-l2.md)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it is inconsistent between the multiclass and binary ... #Resolved
@@ -0,0 +1,27 @@ | |||
This class uses [empirical risk minimization](https://en.wikipedia.org/wiki/Empirical_risk_minimization) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear how this connects with l1-norm and l2-norm regularization. Let me know if you want to discuss this offline. #ByDesign
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ERM ---> without enough data ---> overfit ---> use regularization to overcome overfit.
In reply to: 279009294 [](ancestors = 279009294)
@@ -0,0 +1,27 @@ | |||
This class uses [empricial risk minimization](https://en.wikipedia.org/wiki/Empirical_risk_minimization) (i.e., ERM) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This [](start = 0, length = 5)
Please add a header '### Regularization' so that the following text becomes a separate section. Also move it after all the algo details.
#ByDesign
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. I don't only mean regularization. It is a brief introduction to the whole optimization problem.
In reply to: 284914354 [](ancestors = 284914354)
Minimization.](http://www.jmlr.org/papers/volume14/shalev-shwartz13a/shalev-shwartz13a.pdf) | ||
|
||
|
||
Check the See Also section for links to examples of the usage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need this file. you can just move it to the end of algo-details.sdca.md. It's ok that regularization details come after this, b/c it will be a separate section. #ByDesign
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks super strange... Regularization is the reason why SDCA can exist. As you may have known, SDCA solves "Dual form" of the original optimization problem. Without regularization, that dual form may not exist. Consequently, user should learn regularization before jumping into details of SDCA.
In reply to: 284914930 [](ancestors = 284914930)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix #3356.