Skip to content

L1-norm and L2-norm regularization doc #3586

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
May 28, 2019
Merged

L1-norm and L2-norm regularization doc #3586

merged 12 commits into from
May 28, 2019

Conversation

wschin
Copy link
Member

@wschin wschin commented Apr 25, 2019

Fix #3356.

@wschin wschin requested review from codemzs, natke and shmoradims April 25, 2019 20:20
This class use [empricial risk minimization](https://en.wikipedia.org/wiki/Empirical_risk_minimization) to formulate the optimized problem built upon collected data.
If the training data does not contain enough data points (for example, to train a linear model in $n$-dimensional space, we at least need $n$ data points),
(overfitting)(https://en.wikipedia.org/wiki/Overfitting) may happen so the trained model is good at describing training data but may fail to predict correct results in unseen events.
[Regularization](https://en.wikipedia.org/wiki/Regularization_(mathematics)) is a common technique to alleviate such a phenomenon by penalizing the magnitude (usually measureed by [norm function](https://en.wikipedia.org/wiki/Norm_(mathematics))) of model parameters.
Copy link

@shmoradims shmoradims Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

measureed [](start = 167, length = 9)

typo: measured #Resolved

If the training data does not contain enough data points (for example, to train a linear model in $n$-dimensional space, we at least need $n$ data points),
(overfitting)(https://en.wikipedia.org/wiki/Overfitting) may happen so the trained model is good at describing training data but may fail to predict correct results in unseen events.
[Regularization](https://en.wikipedia.org/wiki/Regularization_(mathematics)) is a common technique to alleviate such a phenomenon by penalizing the magnitude (usually measureed by [norm function](https://en.wikipedia.org/wiki/Norm_(mathematics))) of model parameters.
This trainer supports [elastic net regularization](https://en.wikipedia.org/wiki/Elastic_net_regularization) which penalizing a linear combination of L1-norm (LASSO), $|| \textbf{w}_c ||_1$, and L2-norm (ridge), $|| \textbf{w}_c ||_2^2$ regularizations.
Copy link

@shmoradims shmoradims Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

penalizing [](start = 115, length = 10)

penalizes #Resolved

/// Using L1-norm can increase sparsity of the trained $\textbf{w}_c$.
/// When working with high-dimensional data, it shrinks small weights of irrelevant features to 0 and therefore no resource will be spent on those bad features when making prediction.
/// L2-norm regularization is preferable for data that is not sparse and it largely penalizes the existence of large weights.
/// Togehter with the implemented optimization algorithm, L1-norm regularization can increase the sparsity of the $\textbf{w}_1,\dots,\textbf{w}_m$.
Copy link
Member

@ganik ganik Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Togehter [](start = 8, length = 8)

typo #Resolved

/// Using L1-norm can increase sparsity of the trained $\textbf{w}_c$.
/// When working with high-dimensional data, it shrinks small weights of irrelevant features to 0 and therefore no resource will be spent on those bad features when making prediction.
/// L2-norm regularization is preferable for data that is not sparse and it largely penalizes the existence of large weights.
/// Togehter with the implemented optimization algorithm, L1-norm regularization can increase the sparsity of the $\textbf{w}_1,\dots,\textbf{w}_m$.
Copy link
Member

@ganik ganik Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Togehter [](start = 8, length = 8)

typo #Resolved

This trainer supports [elastic net regularization](https://en.wikipedia.org/wiki/Elastic_net_regularization) which penalizing a linear combination of L1-norm (LASSO), $|| \textbf{w}_c ||_1$, and L2-norm (ridge), $|| \textbf{w}_c ||_2^2$ regularizations.
L1-norm and L2-norm regularizations have different effects and uses that are complementary in certain respects.
Togehter with the implemented optimization algorithm, L1-norm regularization can increase the sparsity of the $\textbf{w}_1,\dots,\textbf{w}_m$.
For high-dimention and sparse data set, if user carefully select the coefficient of L1-norm, it is possible to achieve a good prediction quality with a model with a few of non-zeros (e.g., 1% values) in $\textbf{w}_1,\dots,\textbf{w}_m$ without affecting its .
Copy link

@shmoradims shmoradims Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

. [](start = 258, length = 2)

missing a word #Resolved

/// Using L1-norm can increase sparsity of the trained $\textbf{w}_c$.
/// When working with high-dimensional data, it shrinks small weights of irrelevant features to 0 and therefore no resource will be spent on those bad features when making prediction.
/// L2-norm regularization is preferable for data that is not sparse and it largely penalizes the existence of large weights.
/// Togehter with the implemented optimization algorithm, L1-norm regularization can increase the sparsity of the $\textbf{w}_1,\dots,\textbf{w}_m$.
Copy link
Member

@ganik ganik Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Togehter [](start = 8, length = 8)

uses #Pending

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you mean users.


In reply to: 278723627 [](ancestors = 278723627)

/// An accurate model with extreme coefficient values would be penalized more, but a less accurate model with more conservative values would be penalized less.
///
/// This trainer supports [elastic net regularization](https://en.wikipedia.org/wiki/Elastic_net_regularization): a linear combination of L1-norm (LASSO), $|| \textbf{w}_c ||_1$, and L2-norm (ridge), $|| \textbf{w}_c ||_2^2$ regularizations.
/// This class use [empricial risk minimization](https://en.wikipedia.org/wiki/Empirical_risk_minimization) to formulate the optimized problem built upon collected data.
Copy link
Member

@ganik ganik Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optimized [](start = 129, length = 9)

optimization ? #Resolved

/// This trainer supports [elastic net regularization](https://en.wikipedia.org/wiki/Elastic_net_regularization): a linear combination of L1-norm (LASSO), $|| \textbf{w}_c ||_1$, and L2-norm (ridge), $|| \textbf{w}_c ||_2^2$ regularizations.
/// This class use [empricial risk minimization](https://en.wikipedia.org/wiki/Empirical_risk_minimization) to formulate the optimized problem built upon collected data.
/// If the training data does not contain enough data points (for example, to train a linear model in $n$-dimensional space, we at least need $n$ data points),
/// (overfitting)(https://en.wikipedia.org/wiki/Overfitting) may happen so the trained model is good at describing training data but may fail to predict correct results in unseen events.
Copy link
Member

@ganik ganik Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so [](start = 76, length = 2)

when #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be so that.


In reply to: 278724124 [](ancestors = 278724124)

This trainer supports [elastic net regularization](https://en.wikipedia.org/wiki/Elastic_net_regularization) which penalizing a linear combination of L1-norm (LASSO), $|| \textbf{w}_c ||_1$, and L2-norm (ridge), $|| \textbf{w}_c ||_2^2$ regularizations.
L1-norm and L2-norm regularizations have different effects and uses that are complementary in certain respects.
Togehter with the implemented optimization algorithm, L1-norm regularization can increase the sparsity of the $\textbf{w}_1,\dots,\textbf{w}_m$.
For high-dimention and sparse data set, if user carefully select the coefficient of L1-norm, it is possible to achieve a good prediction quality with a model with a few of non-zeros (e.g., 1% values) in $\textbf{w}_1,\dots,\textbf{w}_m$ without affecting its .
Copy link

@shmoradims shmoradims Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dimention [](start = 9, length = 9)

dimension

please use spell checker plugin #Resolved

Copy link
Member Author

@wschin wschin Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't show anything.. It was ok yesterday. Let me try Vim's.


In reply to: 278724688 [](ancestors = 278724688)

Copy link

@shmoradims shmoradims Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right, the plugin is useless in markdown sections. not sure it there's option to make it look at those sections. You might be able to temporarily remove the CDATA tags, and see if typos show up like the way they do in

#Resolved

This trainer supports [elastic net regularization](https://en.wikipedia.org/wiki/Elastic_net_regularization) which penalizing a linear combination of L1-norm (LASSO), $|| \textbf{w}_c ||_1$, and L2-norm (ridge), $|| \textbf{w}_c ||_2^2$ regularizations.
L1-norm and L2-norm regularizations have different effects and uses that are complementary in certain respects.
Togehter with the implemented optimization algorithm, L1-norm regularization can increase the sparsity of the $\textbf{w}_1,\dots,\textbf{w}_m$.
For high-dimention and sparse data set, if user carefully select the coefficient of L1-norm, it is possible to achieve a good prediction quality with a model with a few of non-zeros (e.g., 1% values) in $\textbf{w}_1,\dots,\textbf{w}_m$ without affecting its .
Copy link

@shmoradims shmoradims Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with a few of non-zeros [](start = 158, length = 23)

that has only a few non-zero weights #Resolved

This trainer supports [elastic net regularization](https://en.wikipedia.org/wiki/Elastic_net_regularization) which penalizing a linear combination of L1-norm (LASSO), $|| \textbf{w}_c ||_1$, and L2-norm (ridge), $|| \textbf{w}_c ||_2^2$ regularizations.
L1-norm and L2-norm regularizations have different effects and uses that are complementary in certain respects.
Togehter with the implemented optimization algorithm, L1-norm regularization can increase the sparsity of the $\textbf{w}_1,\dots,\textbf{w}_m$.
For high-dimention and sparse data set, if user carefully select the coefficient of L1-norm, it is possible to achieve a good prediction quality with a model with a few of non-zeros (e.g., 1% values) in $\textbf{w}_1,\dots,\textbf{w}_m$ without affecting its .
Copy link

@shmoradims shmoradims Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

values [](start = 192, length = 6)

1% of weights #Resolved

(overfitting)(https://en.wikipedia.org/wiki/Overfitting) may happen so the trained model is good at describing training data but may fail to predict correct results in unseen events.
[Regularization](https://en.wikipedia.org/wiki/Regularization_(mathematics)) is a common technique to alleviate such a phenomenon by penalizing the magnitude (usually measureed by [norm function](https://en.wikipedia.org/wiki/Norm_(mathematics))) of model parameters.
This trainer supports [elastic net regularization](https://en.wikipedia.org/wiki/Elastic_net_regularization) which penalizing a linear combination of L1-norm (LASSO), $|| \textbf{w}_c ||_1$, and L2-norm (ridge), $|| \textbf{w}_c ||_2^2$ regularizations.
L1-norm and L2-norm regularizations have different effects and uses that are complementary in certain respects.
Copy link

@shmoradims shmoradims Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add blank line for new paragraph. #Resolved

[Regularization](https://en.wikipedia.org/wiki/Regularization_(mathematics)) is a common technique to alleviate such a phenomenon by penalizing the magnitude (usually measureed by [norm function](https://en.wikipedia.org/wiki/Norm_(mathematics))) of model parameters.
This trainer supports [elastic net regularization](https://en.wikipedia.org/wiki/Elastic_net_regularization) which penalizing a linear combination of L1-norm (LASSO), $|| \textbf{w}_c ||_1$, and L2-norm (ridge), $|| \textbf{w}_c ||_2^2$ regularizations.
L1-norm and L2-norm regularizations have different effects and uses that are complementary in certain respects.
Togehter with the implemented optimization algorithm, L1-norm regularization can increase the sparsity of the $\textbf{w}_1,\dots,\textbf{w}_m$.
Copy link

@shmoradims shmoradims Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$\textbf{w}_1,\dots,\textbf{w}_m$ [](start = 110, length = 33)

let's call this something like, model weights, or model parameters, and not repeat it below. #Resolved

values to the error of the hypothesis. An accurate model with extreme
coefficient values would be penalized more, but a less accurate model with more
conservative values would be penalized less.
This class use [empricial risk minimization](https://en.wikipedia.org/wiki/Empirical_risk_minimization) to formulate the optimized problem built upon collected data.
Copy link

@shmoradims shmoradims Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class use empricial risk minimization to formulat [](start = 0, length = 115)

please limit the line width, so that it's easy to review without going left and right. it's also a good practice for viewing the file on github. #Resolved

@wschin wschin requested review from shmoradims and ganik and removed request for shmoradims April 25, 2019 21:35
This class use [empricial risk minimization](https://en.wikipedia.org/wiki/Empirical_risk_minimization)
to formulate the optimization problem built upon collected data.
If the training data does not contain enough data points
(for example, to train a linear model in $n$-dimensional space, we at least need $n$ data points),
Copy link
Member

@najeeb-kazmi najeeb-kazmi Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at least need $n$ [](start = 67, length = 17)

need at least $n$ #Resolved

[Regularization](https://en.wikipedia.org/wiki/Regularization_(mathematics)) is a common technique to alleviate
such a phenomenon by penalizing the magnitude (usually measured by
[norm function](https://en.wikipedia.org/wiki/Norm_(mathematics))) of model parameters.
This trainer supports [elastic net regularization](https://en.wikipedia.org/wiki/Elastic_net_regularization)
Copy link
Member

@najeeb-kazmi najeeb-kazmi Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

) [](start = 107, length = 1)

Add comma after ) (in general whenever the next word is "which") #Resolved

Sometimes, using L2-norm leads to a better prediction quality, so users may still want to try it and fine tune the coefficients of L1-norm and L2-norm.
Note that conceptually, using L1-norm implies that the distribution of all model parameters is a
[Laplace distribution](https://en.wikipedia.org/wiki/Laplace_distribution) while
L2-norm means that a [Gaussian distribution](https://en.wikipedia.org/wiki/Normal_distribution) for them.
Copy link
Member

@najeeb-kazmi najeeb-kazmi Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

means that a [](start = 8, length = 12)

"assumes a Gaussian distribution" or "implies a Gaussian distribution" #Resolved

L2-norm means that a [Gaussian distribution](https://en.wikipedia.org/wiki/Normal_distribution) for them.

An aggressive regularization (that is, assigning large coefficients to L1-norm or L2-norm regularization terms)
can harm predictive capacity by excluding important variables out of the model.
Copy link
Member

@najeeb-kazmi najeeb-kazmi Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of the model [](start = 62, length = 16)

from the model #Resolved

@@ -59,22 +59,12 @@ namespace Microsoft.ML.Trainers
/// In other cases, the output score vector is just $[\hat{y}^1, \dots, \hat{y}^m]$.
///
/// ### Training Algorithm Details
/// The optimization algorithm is an extension of (http://jmlr.org/papers/volume14/shalev-shwartz13a/shalev-shwartz13a.pdf) following a similar path proposed in an earlier [paper](https://www.csie.ntu.edu.tw/~cjlin/papers/maxent_dual.pdf).
/// It is usually much faster than [L-BFGS](https://en.wikipedia.org/wiki/Limited-memory_BFGS) and [truncated Newton methods](https://en.wikipedia.org/wiki/Truncated_Newton_method) for large-scale and sparse data set.
/// The optimization algorithm is an extension of (http://jmlr.org/papers/volume14/shalev-shwartz13a/shalev-shwartz13a.pdf)
Copy link
Member

@najeeb-kazmi najeeb-kazmi Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(http://jmlr.org/papers/volume14/shalev-shwartz13a/shalev-shwartz13a.pdf) [](start = 54, length = 73)

Maybe give this link some display text [like this] ? #Resolved

/// The optimization algorithm is an extension of (http://jmlr.org/papers/volume14/shalev-shwartz13a/shalev-shwartz13a.pdf)
/// following a similar path proposed in an earlier [paper](https://www.csie.ntu.edu.tw/~cjlin/papers/maxent_dual.pdf).
/// It is usually much faster than [L-BFGS](https://en.wikipedia.org/wiki/Limited-memory_BFGS) and
/// [truncated Newton methods](https://en.wikipedia.org/wiki/Truncated_Newton_method) for large-scale and sparse data set.
Copy link
Member

@najeeb-kazmi najeeb-kazmi Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data set [](start = 117, length = 8)

data sets #Resolved

to formulate the optimization problem built upon collected data.
If the training data does not contain enough data points
(for example, to train a linear model in $n$-dimensional space, we at least need $n$ data points),
(overfitting)(https://en.wikipedia.org/wiki/Overfitting) may happen so that
Copy link
Member

@najeeb-kazmi najeeb-kazmi Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(overfitting) [](start = 0, length = 13)

square brackets [overfitting] #Resolved

L1-norm and L2-norm regularizations have different effects and uses that are complementary in certain respects.

Together with the implemented optimization algorithm, L1-norm regularization can increase the sparsity of the model weights, $\textbf{w}_1,\dots,\textbf{w}_m$.
For high-dimension and sparse data set, if users carefully select the coefficient of L1-norm,
Copy link
Member

@najeeb-kazmi najeeb-kazmi Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high-dimension [](start = 4, length = 14)

"high-dimensional" perhaps? #Resolved

L1-norm and L2-norm regularizations have different effects and uses that are complementary in certain respects.

Together with the implemented optimization algorithm, L1-norm regularization can increase the sparsity of the model weights, $\textbf{w}_1,\dots,\textbf{w}_m$.
For high-dimension and sparse data set, if users carefully select the coefficient of L1-norm,
Copy link
Member

@najeeb-kazmi najeeb-kazmi Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data set [](start = 30, length = 8)

data sets #Resolved

Copy link
Member

@najeeb-kazmi najeeb-kazmi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@@ -0,0 +1,27 @@
This class use [empricial risk minimization](https://en.wikipedia.org/wiki/Empirical_risk_minimization)
Copy link
Contributor

@natke natke Apr 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this in the l1-norm and l2-norm regularization include? #Resolved

Copy link
Member Author

@wschin wschin Apr 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are optimize regularized ERM. ER is also known as loss function. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add Note that empricial risk is usually measured by applying a loss function on the model's predictions on collected data points.


In reply to: 278778511 [](ancestors = 278778511)

///
/// An aggressive regularization (that is, assigning large coefficients to L1-norm or L2-norm regularization terms) can harm predictive capacity by excluding important variables out of the model.
/// Therefore, choosing the right regularization coefficients is important in practice.
/// [!include[regularization](~/../docs/samples/docs/api-reference/regularization-l1-l2.md)]
Copy link
Contributor

@natke natke Apr 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we including this in the base class definition for multiclass and in the derived classes for binary classification? #ByDesign

Copy link
Member Author

@wschin wschin Apr 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a common behavior shared by all derived classes. #Resolved

Copy link
Contributor

@natke natke Apr 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it is inconsistent between the multiclass and binary ... #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? You mean their doc contents are different?


In reply to: 279005984 [](ancestors = 279005984)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason is that they are written by different persons. Multiclass' XML doc is referenced in derived classes' XML docs, so there is no difference actually. I honestly don't have much time for writing style.


In reply to: 278777687 [](ancestors = 278777687)

@codecov
Copy link

codecov bot commented Apr 26, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@51b10fc). Click here to learn what that means.
The diff coverage is n/a.

@@            Coverage Diff            @@
##             master    #3586   +/-   ##
=========================================
  Coverage          ?   72.76%           
=========================================
  Files             ?      808           
  Lines             ?   145458           
  Branches          ?    16244           
=========================================
  Hits              ?   105844           
  Misses            ?    35191           
  Partials          ?     4423
Flag Coverage Δ
#Debug 72.76% <ø> (?)
#production 68.27% <ø> (?)
#test 89.04% <ø> (?)
Impacted Files Coverage Δ
...LogisticRegression/MulticlassLogisticRegression.cs 67.61% <ø> (ø)
...oft.ML.StandardTrainers/Standard/SdcaMulticlass.cs 91.12% <ø> (ø)
...oft.ML.StandardTrainers/Standard/SdcaRegression.cs 95.83% <ø> (ø)
...crosoft.ML.StandardTrainers/Standard/SdcaBinary.cs 72.95% <ø> (ø)

@wschin wschin requested a review from natke April 26, 2019 15:30
///
/// An aggressive regularization (that is, assigning large coefficients to L1-norm or L2-norm regularization terms) can harm predictive capacity by excluding important variables out of the model.
/// Therefore, choosing the right regularization coefficients is important in practice.
/// [!include[regularization](~/../docs/samples/docs/api-reference/regularization-l1-l2.md)]
Copy link
Contributor

@natke natke Apr 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it is inconsistent between the multiclass and binary ... #Resolved

@@ -0,0 +1,27 @@
This class uses [empirical risk minimization](https://en.wikipedia.org/wiki/Empirical_risk_minimization)
Copy link
Contributor

@natke natke Apr 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear how this connects with l1-norm and l2-norm regularization. Let me know if you want to discuss this offline. #ByDesign

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ERM ---> without enough data ---> overfit ---> use regularization to overcome overfit.


In reply to: 279009294 [](ancestors = 279009294)

@wschin wschin requested a review from natke April 26, 2019 17:15
@@ -0,0 +1,27 @@
This class uses [empricial risk minimization](https://en.wikipedia.org/wiki/Empirical_risk_minimization) (i.e., ERM)
Copy link

@shmoradims shmoradims May 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This [](start = 0, length = 5)

Please add a header '### Regularization' so that the following text becomes a separate section. Also move it after all the algo details.
#ByDesign

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. I don't only mean regularization. It is a brief introduction to the whole optimization problem.


In reply to: 284914354 [](ancestors = 284914354)

Minimization.](http://www.jmlr.org/papers/volume14/shalev-shwartz13a/shalev-shwartz13a.pdf)


Check the See Also section for links to examples of the usage.
Copy link

@shmoradims shmoradims May 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need this file. you can just move it to the end of algo-details.sdca.md. It's ok that regularization details come after this, b/c it will be a separate section. #ByDesign

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks super strange... Regularization is the reason why SDCA can exist. As you may have known, SDCA solves "Dual form" of the original optimization problem. Without regularization, that dual form may not exist. Consequently, user should learn regularization before jumping into details of SDCA.


In reply to: 284914930 [](ancestors = 284914930)

@wschin wschin requested a review from shmoradims May 24, 2019 20:19
Copy link

@shmoradims shmoradims left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@wschin wschin merged commit a1b5eaa into dotnet:master May 28, 2019
@ghost ghost locked as resolved and limited conversation to collaborators Mar 22, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

L1 and L2 Regularization
5 participants