L1-norm and L2-norm regularization doc (dotnet#3586)

wschin · web-flow · commit a1b5eaabedda · 2019-05-28T10:37:25.000-07:00
diff --git a/docs/api-reference/algo-details-sdca-refs.md b/docs/api-reference/algo-details-sdca-refs.md
@@ -0,0 +1,8 @@
+For more information, see:
+* [Scaling Up Stochastic Dual Coordinate
+  Ascent.](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/06/main-3.pdf)
+* [Stochastic Dual Coordinate Ascent Methods for Regularized Loss
+  Minimization.](http://www.jmlr.org/papers/volume14/shalev-shwartz13a/shalev-shwartz13a.pdf)
+
+
+Check the See Also section for links to examples of the usage.
diff --git a/docs/api-reference/algo-details-sdca.md b/docs/api-reference/algo-details-sdca.md
@@ -27,36 +27,3 @@ and therefore everyone eventually reaches the same place. Even in
 non-strongly-convex cases, you will get equally-good solutions from run to run.
 For reproducible results, it is recommended that one sets 'Shuffle' to False and
 'NumThreads' to 1.
-
-This learner supports [elastic net
-regularization](https://en.wikipedia.org/wiki/Elastic_net_regularization): a
-linear combination of the L1 and L2 penalties of the [lasso and ridge
-methods](https://www.datacamp.com/community/tutorials/tutorial-ridge-lasso-elastic-net).
-It can be specified by the 'L2Const' and 'L1Threshold' parameters. Note that the
-'L2Const' has an effect on the number of needed training iterations. In general,
-the larger the 'L2Const', the less number of iterations is needed to achieve a
-reasonable solution. Regularization is a method that can render an ill-posed
-problem more tractable and prevents overfitting by penalizing model's magnitude
-usually measured by some norm functions. This can improve the generalization of
-the model learned by selecting the optimal complexity in the [bias-variance
-tradeoff](https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff).
-Regularization works by adding the penalty that is associated with coefficient
-values to the error of the hypothesis. An accurate model with extreme
-coefficient values would be penalized more, but a less accurate model with more
-conservative values would be penalized less.
-
-L1-nrom and L2-norm regularizations have different effects and uses that are
-complementary in certain respects. Using L1-norm can increase sparsity of the
-trained $\textbf{w}$. When working with high-dimensional data, it shrinks small
-weights of irrevalent features to 0 and therefore no reource will be spent on
-those bad features when making prediction. L2-norm regularization is preferable
-for data that is not sparse and it largely penalizes the existence of large
-weights.
-
-For more information, see:
-* [Scaling Up Stochastic Dual Coordinate
-  Ascent.](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/06/main-3.pdf)
-* [Stochastic Dual Coordinate Ascent Methods for Regularized Loss
-  Minimization.](http://www.jmlr.org/papers/volume14/shalev-shwartz13a/shalev-shwartz13a.pdf)
-
-Check the See Also section for links to examples of the usage.
diff --git a/docs/api-reference/regularization-l1-l2.md b/docs/api-reference/regularization-l1-l2.md
@@ -0,0 +1,28 @@
+This class uses [empricial risk minimization](https://en.wikipedia.org/wiki/Empirical_risk_minimization) (i.e., ERM)
+to formulate the optimization problem built upon collected data.
+Note that empricial risk is usually measured by applying a loss function on the model's predictions on collected data points.
+If the training data does not contain enough data points
+(for example, to train a linear model in $n$-dimensional space, we need at least $n$ data points),
+[overfitting](https://en.wikipedia.org/wiki/Overfitting) may happen so that
+the model produced by ERM is good at describing training data but may fail to predict correct results in unseen events.
+[Regularization](https://en.wikipedia.org/wiki/Regularization_(mathematics)) is a common technique to alleviate
+such a phenomenon by penalizing the magnitude (usually measured by the
+[norm function](https://en.wikipedia.org/wiki/Norm_(mathematics))) of model parameters.
+This trainer supports [elastic net regularization](https://en.wikipedia.org/wiki/Elastic_net_regularization),
+which penalizes a linear combination of L1-norm (LASSO), $|| \textbf{w}_c ||_1$, and L2-norm (ridge), $|| \textbf{w}_c ||_2^2$ regularizations for $c=1,\dots,m$.
+L1-norm and L2-norm regularizations have different effects and uses that are complementary in certain respects.
+
+Together with the implemented optimization algorithm, L1-norm regularization can increase the sparsity of the model weights, $\textbf{w}_1,\dots,\textbf{w}_m$.
+For high-dimensional and sparse data sets, if users carefully select the coefficient of L1-norm,
+it is possible to achieve a good prediction quality with a model that has only a few non-zero weights
+(e.g., 1% of total model weights) without affecting its prediction power.
+In contrast, L2-norm cannot increase the sparsity of the trained model but can still prevent overfitting by avoiding large parameter values.
+Sometimes, using L2-norm leads to a better prediction quality, so users may still want to try it and fine tune the coefficients of L1-norm and L2-norm.
+Note that conceptually, using L1-norm implies that the distribution of all model parameters is a
+[Laplace distribution](https://en.wikipedia.org/wiki/Laplace_distribution) while
+L2-norm implies a [Gaussian distribution](https://en.wikipedia.org/wiki/Normal_distribution) for them.
+
+An aggressive regularization (that is, assigning large coefficients to L1-norm or L2-norm regularization terms)
+can harm predictive capacity by excluding important variables from the model.
+For example, a very large L1-norm coefficient may force all parameters to be zeros and lead to a trivial model.
+Therefore, choosing the right regularization coefficients is important in practice.
diff --git a/src/Microsoft.ML.StandardTrainers/Standard/LogisticRegression/MulticlassLogisticRegression.cs b/src/Microsoft.ML.StandardTrainers/Standard/LogisticRegression/MulticlassLogisticRegression.cs
@@ -76,20 +76,7 @@ namespace Microsoft.ML.Trainers
     /// Since L-BFGS approximation uses only a limited amount of historical states to compute the next step direction, it is especially suited for problems with a high-dimensional feature vector.
     /// The number of historical states is a user-specified parameter, using a larger number may lead to a better approximation of the Hessian matrix but also a higher computation cost per step.
     ///
-    /// Regularization is a method that can render an ill-posed problem more tractable by imposing constraints that provide information to supplement the data. It prevents overfitting by penalizing the model's magnitude, usually measured by some norm functions.
-    /// This can improve the generalization of the model learned by selecting the optimal complexity in the [bias-variance trade-off](https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff).
-    /// Regularization works by adding a penalty that is associated with coefficient values to the error of the hypothesis.
-    /// An accurate model with extreme coefficient values would be penalized more, but a less accurate model with more conservative values would be penalized less.
-    ///
-    /// This learner supports [elastic net regularization](https://en.wikipedia.org/wiki/Elastic_net_regularization): a linear combination of L1-norm (LASSO), $|| \textbf{w} ||_1$, and L2-norm (ridge), $|| \textbf{w} ||_2^2$ regularizations.
-    /// L1-norm and L2-norm regularizations have different effects and uses that are complementary in certain respects.
-    /// Using L1-norm can increase sparsity of the trained $\textbf{w}$.
-    /// When working with high-dimensional data, it shrinks small weights of irrelevant features to 0 and therefore no resource will be spent on those bad features when making prediction.
-    /// If L1-norm regularization is used, the training algorithm used would be [OWL-QN](http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.68.5260).
-    /// L2-norm regularization is preferable for data that is not sparse and it largely penalizes the existence of large weights.
-    ///
-    /// An aggressive regularization (that is, assigning large coefficients to L1-norm or L2-norm regularization terms) can harm predictive capacity by excluding important variables from the model.
-    /// Therefore, choosing the right regularization coefficients is important when applying maximum entropy classifier.
+    /// [!include[io](~/../docs/samples/docs/api-reference/regularization-l1-l2.md)]
     ///
     /// Check the See Also section for links to usage examples.
     /// ]]>
diff --git a/src/Microsoft.ML.StandardTrainers/Standard/SdcaBinary.cs b/src/Microsoft.ML.StandardTrainers/Standard/SdcaBinary.cs
@@ -1566,6 +1566,10 @@ private protected override BinaryPredictionTransformer<TModelParameters> MakeTra
     /// | Required NuGet in addition to Microsoft.ML | None |
     ///
     /// [!include[algorithm](~/../docs/samples/docs/api-reference/algo-details-sdca.md)]
+    ///
+    /// [!include[regularization](~/../docs/samples/docs/api-reference/regularization-l1-l2.md)]
+    ///
+    /// [!include[references](~/../docs/samples/docs/api-reference/algo-details-sdca-refs.md)]
     /// ]]>
     /// </format>
     /// </remarks>
@@ -1652,6 +1656,10 @@ private protected override SchemaShape.Column[] ComputeSdcaBinaryClassifierSchem
     /// | Required NuGet in addition to Microsoft.ML | None |
     ///
     /// [!include[algorithm](~/../docs/samples/docs/api-reference/algo-details-sdca.md)]
+    ///
+    /// [!include[regularization](~/../docs/samples/docs/api-reference/regularization-l1-l2.md)]
+    ///
+    /// [!include[references](~/../docs/samples/docs/api-reference/algo-details-sdca-refs.md)]
     /// ]]>
     /// </format>
     /// </remarks>
diff --git a/src/Microsoft.ML.StandardTrainers/Standard/SdcaMulticlass.cs b/src/Microsoft.ML.StandardTrainers/Standard/SdcaMulticlass.cs
@@ -59,22 +59,12 @@ namespace Microsoft.ML.Trainers
     /// In other cases, the output score vector is just $[\hat{y}^1, \dots, \hat{y}^m]$.
     ///
     /// ### Training Algorithm Details
-    /// The optimization algorithm is an extension of (http://jmlr.org/papers/volume14/shalev-shwartz13a/shalev-shwartz13a.pdf) following a similar path proposed in an earlier [paper](https://www.csie.ntu.edu.tw/~cjlin/papers/maxent_dual.pdf).
-    /// It is usually much faster than [L-BFGS](https://en.wikipedia.org/wiki/Limited-memory_BFGS) and [truncated Newton methods](https://en.wikipedia.org/wiki/Truncated_Newton_method) for large-scale and sparse data set.
+    /// The optimization algorithm is an extension of [a coordinate descent method](http://jmlr.org/papers/volume14/shalev-shwartz13a/shalev-shwartz13a.pdf)
+    /// following a similar path proposed in an earlier [paper](https://www.csie.ntu.edu.tw/~cjlin/papers/maxent_dual.pdf).
+    /// It is usually much faster than [L-BFGS](https://en.wikipedia.org/wiki/Limited-memory_BFGS) and
+    /// [truncated Newton methods](https://en.wikipedia.org/wiki/Truncated_Newton_method) for large-scale and sparse data sets.
     ///
-    /// Regularization is a method that can render an ill-posed problem more tractable by imposing constraints that provide information to supplement the data and that prevents overfitting by penalizing model's magnitude usually measured by some norm functions.
-    /// This can improve the generalization of the model learned by selecting the optimal complexity in the bias-variance tradeoff.
-    /// Regularization works by adding the penalty on the magnitude of $\textbf{w}_c$, $c=1,\dots,m$ to the error of the hypothesis.
-    /// An accurate model with extreme coefficient values would be penalized more, but a less accurate model with more conservative values would be penalized less.
-    ///
-    /// This trainer supports [elastic net regularization](https://en.wikipedia.org/wiki/Elastic_net_regularization): a linear combination of L1-norm (LASSO), $|| \textbf{w}_c ||_1$, and L2-norm (ridge), $|| \textbf{w}_c ||_2^2$ regularizations.
-    /// L1-norm and L2-norm regularizations have different effects and uses that are complementary in certain respects.
-    /// Using L1-norm can increase sparsity of the trained $\textbf{w}_c$.
-    /// When working with high-dimensional data, it shrinks small weights of irrelevant features to 0 and therefore no resource will be spent on those bad features when making prediction.
-    /// L2-norm regularization is preferable for data that is not sparse and it largely penalizes the existence of large weights.
-    ///
-    /// An aggressive regularization (that is, assigning large coefficients to L1-norm or L2-norm regularization terms) can harm predictive capacity by excluding important variables out of the model.
-    /// Therefore, choosing the right regularization coefficients is important in practice.
+    /// [!include[regularization](~/../docs/samples/docs/api-reference/regularization-l1-l2.md)]
     ///
     /// Check the See Also section for links to usage examples.
     /// ]]>
diff --git a/src/Microsoft.ML.StandardTrainers/Standard/SdcaRegression.cs b/src/Microsoft.ML.StandardTrainers/Standard/SdcaRegression.cs
@@ -40,6 +40,10 @@ namespace Microsoft.ML.Trainers
     /// | Required NuGet in addition to Microsoft.ML | None |
     ///
     /// [!include[algorithm](~/../docs/samples/docs/api-reference/algo-details-sdca.md)]
+    ///
+    /// [!include[regularization](~/../docs/samples/docs/api-reference/regularization-l1-l2.md)]
+    ///
+    /// [!include[references](~/../docs/samples/docs/api-reference/algo-details-sdca-refs.md)]
     /// ]]>
     /// </format>
     /// </remarks>