Skip to content

Commit a1b5eaa

Browse files
authored
L1-norm and L2-norm regularization doc (dotnet#3586)
1 parent b6bf1fb commit a1b5eaa

File tree

7 files changed

+54
-62
lines changed

7 files changed

+54
-62
lines changed
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
For more information, see:
2+
* [Scaling Up Stochastic Dual Coordinate
3+
Ascent.](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/06/main-3.pdf)
4+
* [Stochastic Dual Coordinate Ascent Methods for Regularized Loss
5+
Minimization.](http://www.jmlr.org/papers/volume14/shalev-shwartz13a/shalev-shwartz13a.pdf)
6+
7+
8+
Check the See Also section for links to examples of the usage.

docs/api-reference/algo-details-sdca.md

Lines changed: 0 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -27,36 +27,3 @@ and therefore everyone eventually reaches the same place. Even in
2727
non-strongly-convex cases, you will get equally-good solutions from run to run.
2828
For reproducible results, it is recommended that one sets 'Shuffle' to False and
2929
'NumThreads' to 1.
30-
31-
This learner supports [elastic net
32-
regularization](https://en.wikipedia.org/wiki/Elastic_net_regularization): a
33-
linear combination of the L1 and L2 penalties of the [lasso and ridge
34-
methods](https://www.datacamp.com/community/tutorials/tutorial-ridge-lasso-elastic-net).
35-
It can be specified by the 'L2Const' and 'L1Threshold' parameters. Note that the
36-
'L2Const' has an effect on the number of needed training iterations. In general,
37-
the larger the 'L2Const', the less number of iterations is needed to achieve a
38-
reasonable solution. Regularization is a method that can render an ill-posed
39-
problem more tractable and prevents overfitting by penalizing model's magnitude
40-
usually measured by some norm functions. This can improve the generalization of
41-
the model learned by selecting the optimal complexity in the [bias-variance
42-
tradeoff](https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff).
43-
Regularization works by adding the penalty that is associated with coefficient
44-
values to the error of the hypothesis. An accurate model with extreme
45-
coefficient values would be penalized more, but a less accurate model with more
46-
conservative values would be penalized less.
47-
48-
L1-nrom and L2-norm regularizations have different effects and uses that are
49-
complementary in certain respects. Using L1-norm can increase sparsity of the
50-
trained $\textbf{w}$. When working with high-dimensional data, it shrinks small
51-
weights of irrevalent features to 0 and therefore no reource will be spent on
52-
those bad features when making prediction. L2-norm regularization is preferable
53-
for data that is not sparse and it largely penalizes the existence of large
54-
weights.
55-
56-
For more information, see:
57-
* [Scaling Up Stochastic Dual Coordinate
58-
Ascent.](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/06/main-3.pdf)
59-
* [Stochastic Dual Coordinate Ascent Methods for Regularized Loss
60-
Minimization.](http://www.jmlr.org/papers/volume14/shalev-shwartz13a/shalev-shwartz13a.pdf)
61-
62-
Check the See Also section for links to examples of the usage.
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
This class uses [empricial risk minimization](https://en.wikipedia.org/wiki/Empirical_risk_minimization) (i.e., ERM)
2+
to formulate the optimization problem built upon collected data.
3+
Note that empricial risk is usually measured by applying a loss function on the model's predictions on collected data points.
4+
If the training data does not contain enough data points
5+
(for example, to train a linear model in $n$-dimensional space, we need at least $n$ data points),
6+
[overfitting](https://en.wikipedia.org/wiki/Overfitting) may happen so that
7+
the model produced by ERM is good at describing training data but may fail to predict correct results in unseen events.
8+
[Regularization](https://en.wikipedia.org/wiki/Regularization_(mathematics)) is a common technique to alleviate
9+
such a phenomenon by penalizing the magnitude (usually measured by the
10+
[norm function](https://en.wikipedia.org/wiki/Norm_(mathematics))) of model parameters.
11+
This trainer supports [elastic net regularization](https://en.wikipedia.org/wiki/Elastic_net_regularization),
12+
which penalizes a linear combination of L1-norm (LASSO), $|| \textbf{w}_c ||_1$, and L2-norm (ridge), $|| \textbf{w}_c ||_2^2$ regularizations for $c=1,\dots,m$.
13+
L1-norm and L2-norm regularizations have different effects and uses that are complementary in certain respects.
14+
15+
Together with the implemented optimization algorithm, L1-norm regularization can increase the sparsity of the model weights, $\textbf{w}_1,\dots,\textbf{w}_m$.
16+
For high-dimensional and sparse data sets, if users carefully select the coefficient of L1-norm,
17+
it is possible to achieve a good prediction quality with a model that has only a few non-zero weights
18+
(e.g., 1% of total model weights) without affecting its prediction power.
19+
In contrast, L2-norm cannot increase the sparsity of the trained model but can still prevent overfitting by avoiding large parameter values.
20+
Sometimes, using L2-norm leads to a better prediction quality, so users may still want to try it and fine tune the coefficients of L1-norm and L2-norm.
21+
Note that conceptually, using L1-norm implies that the distribution of all model parameters is a
22+
[Laplace distribution](https://en.wikipedia.org/wiki/Laplace_distribution) while
23+
L2-norm implies a [Gaussian distribution](https://en.wikipedia.org/wiki/Normal_distribution) for them.
24+
25+
An aggressive regularization (that is, assigning large coefficients to L1-norm or L2-norm regularization terms)
26+
can harm predictive capacity by excluding important variables from the model.
27+
For example, a very large L1-norm coefficient may force all parameters to be zeros and lead to a trivial model.
28+
Therefore, choosing the right regularization coefficients is important in practice.

src/Microsoft.ML.StandardTrainers/Standard/LogisticRegression/MulticlassLogisticRegression.cs

Lines changed: 1 addition & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -76,20 +76,7 @@ namespace Microsoft.ML.Trainers
7676
/// Since L-BFGS approximation uses only a limited amount of historical states to compute the next step direction, it is especially suited for problems with a high-dimensional feature vector.
7777
/// The number of historical states is a user-specified parameter, using a larger number may lead to a better approximation of the Hessian matrix but also a higher computation cost per step.
7878
///
79-
/// Regularization is a method that can render an ill-posed problem more tractable by imposing constraints that provide information to supplement the data. It prevents overfitting by penalizing the model's magnitude, usually measured by some norm functions.
80-
/// This can improve the generalization of the model learned by selecting the optimal complexity in the [bias-variance trade-off](https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff).
81-
/// Regularization works by adding a penalty that is associated with coefficient values to the error of the hypothesis.
82-
/// An accurate model with extreme coefficient values would be penalized more, but a less accurate model with more conservative values would be penalized less.
83-
///
84-
/// This learner supports [elastic net regularization](https://en.wikipedia.org/wiki/Elastic_net_regularization): a linear combination of L1-norm (LASSO), $|| \textbf{w} ||_1$, and L2-norm (ridge), $|| \textbf{w} ||_2^2$ regularizations.
85-
/// L1-norm and L2-norm regularizations have different effects and uses that are complementary in certain respects.
86-
/// Using L1-norm can increase sparsity of the trained $\textbf{w}$.
87-
/// When working with high-dimensional data, it shrinks small weights of irrelevant features to 0 and therefore no resource will be spent on those bad features when making prediction.
88-
/// If L1-norm regularization is used, the training algorithm used would be [OWL-QN](http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.68.5260).
89-
/// L2-norm regularization is preferable for data that is not sparse and it largely penalizes the existence of large weights.
90-
///
91-
/// An aggressive regularization (that is, assigning large coefficients to L1-norm or L2-norm regularization terms) can harm predictive capacity by excluding important variables from the model.
92-
/// Therefore, choosing the right regularization coefficients is important when applying maximum entropy classifier.
79+
/// [!include[io](~/../docs/samples/docs/api-reference/regularization-l1-l2.md)]
9380
///
9481
/// Check the See Also section for links to usage examples.
9582
/// ]]>

src/Microsoft.ML.StandardTrainers/Standard/SdcaBinary.cs

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1566,6 +1566,10 @@ private protected override BinaryPredictionTransformer<TModelParameters> MakeTra
15661566
/// | Required NuGet in addition to Microsoft.ML | None |
15671567
///
15681568
/// [!include[algorithm](~/../docs/samples/docs/api-reference/algo-details-sdca.md)]
1569+
///
1570+
/// [!include[regularization](~/../docs/samples/docs/api-reference/regularization-l1-l2.md)]
1571+
///
1572+
/// [!include[references](~/../docs/samples/docs/api-reference/algo-details-sdca-refs.md)]
15691573
/// ]]>
15701574
/// </format>
15711575
/// </remarks>
@@ -1652,6 +1656,10 @@ private protected override SchemaShape.Column[] ComputeSdcaBinaryClassifierSchem
16521656
/// | Required NuGet in addition to Microsoft.ML | None |
16531657
///
16541658
/// [!include[algorithm](~/../docs/samples/docs/api-reference/algo-details-sdca.md)]
1659+
///
1660+
/// [!include[regularization](~/../docs/samples/docs/api-reference/regularization-l1-l2.md)]
1661+
///
1662+
/// [!include[references](~/../docs/samples/docs/api-reference/algo-details-sdca-refs.md)]
16551663
/// ]]>
16561664
/// </format>
16571665
/// </remarks>

src/Microsoft.ML.StandardTrainers/Standard/SdcaMulticlass.cs

Lines changed: 5 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -59,22 +59,12 @@ namespace Microsoft.ML.Trainers
5959
/// In other cases, the output score vector is just $[\hat{y}^1, \dots, \hat{y}^m]$.
6060
///
6161
/// ### Training Algorithm Details
62-
/// The optimization algorithm is an extension of (http://jmlr.org/papers/volume14/shalev-shwartz13a/shalev-shwartz13a.pdf) following a similar path proposed in an earlier [paper](https://www.csie.ntu.edu.tw/~cjlin/papers/maxent_dual.pdf).
63-
/// It is usually much faster than [L-BFGS](https://en.wikipedia.org/wiki/Limited-memory_BFGS) and [truncated Newton methods](https://en.wikipedia.org/wiki/Truncated_Newton_method) for large-scale and sparse data set.
62+
/// The optimization algorithm is an extension of [a coordinate descent method](http://jmlr.org/papers/volume14/shalev-shwartz13a/shalev-shwartz13a.pdf)
63+
/// following a similar path proposed in an earlier [paper](https://www.csie.ntu.edu.tw/~cjlin/papers/maxent_dual.pdf).
64+
/// It is usually much faster than [L-BFGS](https://en.wikipedia.org/wiki/Limited-memory_BFGS) and
65+
/// [truncated Newton methods](https://en.wikipedia.org/wiki/Truncated_Newton_method) for large-scale and sparse data sets.
6466
///
65-
/// Regularization is a method that can render an ill-posed problem more tractable by imposing constraints that provide information to supplement the data and that prevents overfitting by penalizing model's magnitude usually measured by some norm functions.
66-
/// This can improve the generalization of the model learned by selecting the optimal complexity in the bias-variance tradeoff.
67-
/// Regularization works by adding the penalty on the magnitude of $\textbf{w}_c$, $c=1,\dots,m$ to the error of the hypothesis.
68-
/// An accurate model with extreme coefficient values would be penalized more, but a less accurate model with more conservative values would be penalized less.
69-
///
70-
/// This trainer supports [elastic net regularization](https://en.wikipedia.org/wiki/Elastic_net_regularization): a linear combination of L1-norm (LASSO), $|| \textbf{w}_c ||_1$, and L2-norm (ridge), $|| \textbf{w}_c ||_2^2$ regularizations.
71-
/// L1-norm and L2-norm regularizations have different effects and uses that are complementary in certain respects.
72-
/// Using L1-norm can increase sparsity of the trained $\textbf{w}_c$.
73-
/// When working with high-dimensional data, it shrinks small weights of irrelevant features to 0 and therefore no resource will be spent on those bad features when making prediction.
74-
/// L2-norm regularization is preferable for data that is not sparse and it largely penalizes the existence of large weights.
75-
///
76-
/// An aggressive regularization (that is, assigning large coefficients to L1-norm or L2-norm regularization terms) can harm predictive capacity by excluding important variables out of the model.
77-
/// Therefore, choosing the right regularization coefficients is important in practice.
67+
/// [!include[regularization](~/../docs/samples/docs/api-reference/regularization-l1-l2.md)]
7868
///
7969
/// Check the See Also section for links to usage examples.
8070
/// ]]>

src/Microsoft.ML.StandardTrainers/Standard/SdcaRegression.cs

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,10 @@ namespace Microsoft.ML.Trainers
4040
/// | Required NuGet in addition to Microsoft.ML | None |
4141
///
4242
/// [!include[algorithm](~/../docs/samples/docs/api-reference/algo-details-sdca.md)]
43+
///
44+
/// [!include[regularization](~/../docs/samples/docs/api-reference/regularization-l1-l2.md)]
45+
///
46+
/// [!include[references](~/../docs/samples/docs/api-reference/algo-details-sdca-refs.md)]
4347
/// ]]>
4448
/// </format>
4549
/// </remarks>

0 commit comments

Comments
 (0)